## Taxi Orders in NYC

**Importing dataset from a csv-file in the working directory**

In [4]:
import pandas as pd


taxi = pd.read_csv('2_taxi_nyc.csv')

taxi

Unnamed: 0,pickup_dt,pickup_month,borough,pickups,hday,spd,vsb,temp,dewp,slp,pcp 01,pcp 06,pcp 24,sd
0,2015-01-01 01:00:00,Jan,Bronx,152,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
1,2015-01-01 01:00:00,Jan,Brooklyn,1519,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
2,2015-01-01 01:00:00,Jan,EWR,0,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
3,2015-01-01 01:00:00,Jan,Manhattan,5258,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
4,2015-01-01 01:00:00,Jan,Queens,405,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29096,2015-06-30 23:00:00,Jun,EWR,0,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29097,2015-06-30 23:00:00,Jun,Manhattan,3828,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29098,2015-06-30 23:00:00,Jun,Queens,580,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29099,2015-06-30 23:00:00,Jun,Staten Island,0,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0


**Checking number of rows and columns in the dataset**

In [5]:
taxi.shape

(29101, 14)

**Checking column names**

In [6]:
taxi.columns

Index(['pickup_dt', 'pickup_month', 'borough', 'pickups', 'hday', 'spd', 'vsb',
       'temp', 'dewp', 'slp', 'pcp 01', 'pcp 06', 'pcp 24', 'sd'],
      dtype='object')

**Columns description:**    
* pickup_dt – pickup interval  
* pickup_month – month  
* borough – neighboyrhood the order was made 
* pickups – number of orders through a one hour period  
* hday – was it a holyday or a worday; Y - yes,  N - no  
* spd – wind speed, mph  
* vsb – visibility 
* temp – temperature, F  
* dewp – dewpoint, F  
* slp – atmospheric pressure  
* pcp_01 – precipitations per 1 hour  
* pcp_06 – precipitations per 6 hours  
* pcp_24 – precipitations per 24 hours  
* sd – snow depth  

**Checking columns datatypes**

In [7]:
taxi.dtypes

pickup_dt        object
pickup_month     object
borough          object
pickups           int64
hday             object
spd             float64
vsb             float64
temp            float64
dewp            float64
slp             float64
pcp 01          float64
pcp 06          float64
pcp 24          float64
sd              float64
dtype: object

**Finding the most frequent data type**

In [8]:
taxi.dtypes.value_counts()

float64    9
object     4
int64      1
dtype: int64

**Replacing whitespaces in the names of the columns**

In [9]:
def underscore_rename(name):
    return name.replace(' ', '_')

taxi = taxi.rename(columns=underscore_rename)

taxi

Unnamed: 0,pickup_dt,pickup_month,borough,pickups,hday,spd,vsb,temp,dewp,slp,pcp_01,pcp_06,pcp_24,sd
0,2015-01-01 01:00:00,Jan,Bronx,152,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
1,2015-01-01 01:00:00,Jan,Brooklyn,1519,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
2,2015-01-01 01:00:00,Jan,EWR,0,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
3,2015-01-01 01:00:00,Jan,Manhattan,5258,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
4,2015-01-01 01:00:00,Jan,Queens,405,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29096,2015-06-30 23:00:00,Jun,EWR,0,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29097,2015-06-30 23:00:00,Jun,Manhattan,3828,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29098,2015-06-30 23:00:00,Jun,Queens,580,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0
29099,2015-06-30 23:00:00,Jun,Staten Island,0,N,7.0,10.0,75.0,65.0,1011.8,0.0,0.0,0.0,0.0


**Exploring a number of records for Manhattan district**

In [10]:
taxi.query('borough == "Manhattan"').shape[0]

4343

**Exploring a number of records for Brooklyn district**

In [11]:
taxi.query('borough == "Brooklyn"') \
    .shape[0]

4343

**Count total amount of rides (without grouping)**

In [12]:
taxi['pickups'].sum()

14265773

**Group data by neighbourhood and find from which neighbourhood there were the most number of rides completed**

In [13]:
taxi.groupby('borough').agg({'pickups': 'sum'}).sort_values('pickups', ascending=False)

Unnamed: 0_level_0,pickups
borough,Unnamed: 1_level_1
Manhattan,10367841
Brooklyn,2321035
Queens,1343528
Bronx,220047
Staten Island,6957
EWR,105


In [14]:
# alternative solution
taxi.groupby('borough').agg({'pickups': 'sum'}).idxmax()

pickups    Manhattan
dtype: object

**Find a neighbourhood with the least amount of trips completed**

In [15]:
min_pickups = taxi.groupby('borough').agg({'pickups': 'sum'}).idxmin()

print(min_pickups)

pickups    EWR
dtype: object


**Group data by neighboorhood and by hday columns. Compare the mean value of rides. Find neighbourhoods having more rides on holidays than on a workday.**

In [16]:
gr = taxi.groupby(['borough', 'hday'], as_index = False).agg({'pickups' : 'mean'})
gr.pivot(columns='hday', values='pickups', index='borough').query('Y > N')

hday,N,Y
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
EWR,0.023467,0.041916
Queens,308.899904,320.730539


**Count number of rides by month for each neighbourhood. Sort data in descending order.**

In [17]:
pickups_by_mon_bor = taxi.groupby(['borough', 'pickup_month'], as_index=False) \
                        .agg({'pickups': 'sum'}) \
                        .sort_values('pickups', ascending=False)

pickups_by_mon_bor

Unnamed: 0,borough,pickup_month,pickups
21,Manhattan,Jun,1995388
23,Manhattan,May,1888800
19,Manhattan,Feb,1718571
22,Manhattan,Mar,1661261
18,Manhattan,Apr,1648278
20,Manhattan,Jan,1455543
9,Brooklyn,Jun,482466
11,Brooklyn,May,476087
6,Brooklyn,Apr,378095
10,Brooklyn,Mar,346726


**Create a function to convert Fahrenheit to Celsius. Convert data in temp column and assign values in Celsius to a new column 'temp_C'**

In [18]:
# a function to convert Fahrenheit to Celsius
def temp_to_celcius(temp_f):
    return ((temp_f - 32) * 5) / 9


temp_to_celcius(451)

232.77777777777777

In [19]:
# Checking temperature data in Fahrenheit before conversion
taxi['temp'][:5]

0    30.0
1    30.0
2    30.0
3    30.0
4    30.0
Name: temp, dtype: float64

In [20]:
# Conversion to Celsius and checking temperature data after conversion.
taxi['temp_C'] = round(temp_to_celcius(taxi['temp']), 1)
taxi['temp_C'][:5]

0   -1.1
1   -1.1
2   -1.1
3   -1.1
4   -1.1
Name: temp_C, dtype: float64