# **Uber Data Analysis using Python**

This notebook contains an analysis of Uber Pickups in the New York City from April 2014 to September 2014. 

## **1) Import the libraries**
All the python libraries used in this analysis are listed down below.

In [None]:
#import the libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

## **2) Data Loading and Preparation**

For the analysis, we will be focusing on Uber rides throughout New York City for the months of April through September 2014. There are six files of raw data for each month. The attributes in each dataset are as follows:

* `Date/Time`: Date and Time of the Uber Ride
* `Lat`: Latitude of the Ride location
* `Lon`: Longitude of the Ride location
* `Base`: The TLC base company code affiliated with the Uber Ride

We have to import these dataframes and concatenate the dataframes into one. 

In [None]:
#import dataset 
data_apr = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-apr14.csv")
data_may = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-may14.csv")
data_jun = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-jun14.csv")
data_jul = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-jul14.csv")
data_aug = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-aug14.csv")
data_sep = pd.read_csv("../input/uber-pickups-in-new-york-city/uber-raw-data-sep14.csv")

#combining everrything into one dataframe
df = [data_apr, data_may, data_jun, data_jul, data_aug, data_sep]
data_2014 = pd.concat(df)

data_2014.head()

As we can see from the dataset columns, a lot of other columns can be generated from the `Date/Time` column. We will create `Day`, `Weekday`, `Month` and `Time` columns. 

In [None]:
#splitting the timestamp columns in the dataframe
data_2014['Date/Time'] = pd.to_datetime(data_2014['Date/Time'], format="%m/%d/%Y %H:%M:%S")

data_2014['Month'] = data_2014['Date/Time'].dt.month
data_2014['Day'] = data_2014['Date/Time'].dt.day
data_2014['Time'] = data_2014['Date/Time'].dt.time

data_2014.head()

#Mapping the date time values to string equivalent 
# Day of the week
data_2014['Weekday'] = data_2014['Date/Time'].dt.dayofweek

# Hour
data_2014['Hour'] = data_2014['Date/Time'].dt.hour

# Map month
month = {
    4: 'April',
    5: 'May',
    6: 'June', 
    7: 'July',
    8: 'August',
    9: 'September'
}
data_2014['Month'] = data_2014['Month'].replace(month)

# Map weekday
weekday = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday', 
    4: 'Friday', 
    5: 'Saturday',
    6: 'Sunday'
}
data_2014['Weekday'] = data_2014['Weekday'].replace(weekday)

#removing the redundant columns
data_2014.drop(columns = ['Date/Time', 'Time'], inplace = True)
data_2014.head()

The transformed dataframe has 4534327 rows/instances and 7 columns/attributes in total.

In [None]:
data_2014.shape

## **3) Exploratory and Visualization**

### **Checking for missing values**
This first step is standard for most of analysis we perform. We will first check if the dataset contains any missing values and take corrective measures if present. 

In [None]:
#check for missing values
data_2014.isnull().sum()

Looks like there are no missing values in our dataset. Great!

### **Checking for Duplicate values**

Let's see if the there are any rows of duplicate data in our dataframe.

In [None]:
#checking for duplicate values
duplicates= data_2014[data_2014.duplicated(keep=False)]
duplicates


Duplicate rows do exist in the data set. Given that we do not have much information about the accuracy of a data Latitude/Longitude or Time, these rides may just have happened around the same time and around the same location. Therefore, for the sake of this analysis, we will assume that duplicate pickups are valid.

In [None]:
#quick stats on the dataframe
total_rides = len(data_2014.index)
total_days = len(data_2014[['Month', 'Day']].drop_duplicates())
avg_per_day = np.round(total_rides/total_days, 0)

stats_raw = 'Total Number of Pickups: {}\nTotal Number of Days: {}\nAvg Daily Rides: {}'
print(stats_raw.format(total_rides, total_days, avg_per_day))

According to the results obtained above, we have a total of 4.5 Million rides in New York City over a period of six months and an average of 24778 rides per day. 

In [None]:
#plotting the trips by hours in the day 
hourly = data_2014.pivot_table(index=['Hour'], values='Base', aggfunc='count')

hourly.plot(kind='bar', figsize = (12,8), colormap = "copper")
plt.xlabel('Hour')
plt.ylabel('Total Rides')
plt.title('Rides by Hour of the day')

We observe that the number of trips are higher around 16:00 and 18:00, with a spike at 17:00. It matches the end of a working day in the United States (16:30), the time when the employees go home.

In [None]:
##plotting the trips by days of the week
week_day = data_2014['Weekday'].value_counts()[weekday.values()]

week_day.plot(kind='bar', figsize = (12,8), colormap = "copper")
plt.xlabel('Weekday')
plt.ylabel('Total Rides')
plt.title('Rides by Day of the week')

Most number of rides were booked on Thursday followed by Friday. More number of rides are seen on Tuesday and Wednesday as compared to Saturday and Sunday. But we cannot conclude that weekdays have more number of rides as compared to weekends because of low number on Monday. 

In [None]:
month_day = data_2014.pivot_table(index=['Day'],values='Base',aggfunc='count')
month_day.plot(kind='bar', figsize=(12,8), colormap = 'copper')
plt.xlabel('Day of the Month')
plt.ylabel('Total Rides')
plt.title('Rides by Day of the Month')

The number of trips for the day 31 is a lot less than the others because April, June and September have 30 days. The day with the highest number of trips is the 30. There's not much variation from day to day.

In [None]:
#plotting the trips by months
monthly = data_2014['Month'].value_counts(ascending=True)[month.values()]

monthly.plot(kind='bar', figsize = (12,8), colormap = "copper")
plt.xlabel('Month')
plt.ylabel('Total Pickups in Millions')
plt.title('Total Rides Per Month')

According to the plot above, there was an increase in the number of Uber pickups from April to September 2014.

In [None]:
base = data_2014['Base'].value_counts()

base.plot(kind='bar', figsize = (12,8), colormap = "copper")
plt.xlabel('Base')
plt.ylabel('Total Rides in Millions')
plt.title('Rides by Base Location')

We can clearly see an imbalance in pickups between the different bases as nearly 90% of all pickups come from 3 of the 5 bases.

Through our exploration we are going to visualize:

* Heatmap by Hour and Day.
* Heatmap by Hour and Weekday.
* Heatmap by Month and Day.

In [None]:
#cross analysis
#Defining a function that counts the number of rows
def rows_count(rows):
    return len(rows)

In [None]:
#heatmap by hour and day
hour_day = data_2014.groupby('Hour Day'.split()).apply(rows_count).unstack()

plt.figure(figsize = (12,8))

#Using the seaborn heatmap function 
ax = sns.heatmap(hour_day, cmap="Greys", linewidth = .5)
ax.set(title="Rides by Hour and Day")

We see that the number of trips in increasing throughout the day, with a peak demand in the evening between 16:00 and 18:00. It corresponds to the time where employees finish their work and go home.

In [None]:
#heatmap by hour and weekday
hour_weekday = data_2014.groupby('Hour Weekday'.split(), sort = False).apply(rows_count).unstack()

plt.figure(figsize = (12,8))

ax = sns.heatmap(hour_weekday, cmap="Greys", linewidth = .5)
ax.set(title="Rides by Hour and Weekday")

We can see that on working days (From Monday to Friday) the number of trips is higher from 16:00 to 21:00. It shows even better what we said from the first heatmap.

On Friday the number of trips remains high until 23:00 and continues on early Saturday. It corresponds to the time where people come out from work, then go out for dinner or drink before the weekend.We can notice the same pattern on Saturday, people tend to go out at night, the number of trips remains on high until early Sunday.

In [None]:
#heatmap by day and Month
day_month = data_2014.groupby('Day Month'.split(), sort = False).apply(rows_count).unstack()

plt.figure(figsize = (12,8))

ax = sns.heatmap(day_month, cmap = "Greys", linewidth = .5)
ax.set(title="Rides by Day and Month")

We observe that the number of trips increases each month, we can say that from April to September 2014, Uber was in a continuous improvement process.

We can notice from the visualization a dark spot, it corresponds to the 30 April. The number of trips that day was extreme compared to the rest of the month.

In [None]:
plt.figure(figsize=(12,8))
plt.plot(data_2014['Lon'],data_2014['Lat'], 'k+', ms=0.5)
plt.xlim(-74.2,-73.7)
plt.ylim(40.6,41)

Plotting the pickup locations produces a beautiful outline of different areas in New York City. Right in the center of the plot, we see Manhattan Island where many popular sites reside like Times Square, Broadway, Central Park, etc. Over on the lower right side of the plot, we see a bright spot where JFK airport is located. The fact that pickups are concentrated in these areas is not surprising but is nonetheless very interesting to see.

## Conclusion

Through our analysis of the Uber Pickups in New York City data set in 2014, we managed to get the following insights:

* The peak demand hour is 17:00
* An indicator of Uber's improvement from April to September.
* People tend to use Uber more to go to work working days compared to weekends. 
* People tend to use Uber late at night during weekends.
* Misbalance in pickups between TLC bases
* Concentration of pickups in Manhattan Island and surrounding areas

