# US Bike Rides Analysis Near Bay Area

# Introduction

Bay Area Bike Share is a company that provides on-demand bike rentals for customers in San Francisco, Redwood City, Palo Alto, Mountain View, and San Jose. Users can unlock bikes from a variety of stations throughout each city, and return them to any station within the same city. Users pay for the service either through a yearly subscription or by purchasing 3-day or 24-hour passes. Users can make an unlimited number of trips, with trips under thirty minutes in length having no additional charge; longer trips will incur overtime fees.

In this project, I will put myself in the shoes of a data analyst performing an exploratory analysis on the data. Here I will take a look at two of the major parts of the data analysis process: data wrangling and exploratory data analysis. Here I will be working on finding out few insights from the data which will help in **making smarter business decisions? Or you might think about if you were a user of the bike share service. What factors might influence how you would want to use the service?**

# Data Wrangling

First, I will be loading all of the packages and functions that i will be using in my analysis by running the first code cell below.

In [1]:
"""This version is with time filter and the city filter."""
import csv
import pandas as pd
import time
import datetime
import collections
from collections import Counter

In [2]:
samp_newyork=pd.read_csv('/Users/pra/Documents/new_york_city.csv')
samp_chicago=pd.read_csv('/Users/pra/Documents/chicago.csv')
samp_washington=pd.read_csv('/Users/pra/Documents/washington.csv')

In [3]:
print ('NAN in newyork')
samp_newyork.isnull().sum()

NAN in newyork


Unnamed: 0           0
Start Time           0
End Time             0
Trip Duration        0
Start Station        0
End Station          0
User Type          692
Gender           29209
Birth Year       28220
dtype: int64

In [4]:
print ('NAN in chicago ')
samp_chicago.isnull().sum()

NAN in chicago 


Unnamed: 0           0
Start Time           0
End Time             0
Trip Duration        0
Start Station        0
End Station          0
User Type            0
Gender           61052
Birth Year       61019
dtype: int64

In [5]:
print ('NAN in washington')
samp_washington.isnull().sum()

NAN in washington


Unnamed: 0       0
Start Time       0
End Time         0
Trip Duration    0
Start Station    0
End Station      0
User Type        0
dtype: int64

Here we can see that there was only missing values in the user type, gender, birth year and the remaining
all others have no missing values so we can carry on the analysis without removing any rows till we
consider these columns with missing values.

**Here I am implementing a function that helps in selecting the area which you want to 
concentrate you can also work on all the areas at once if wanted.**

In [7]:
def get_city():
    '''Asks the user for a city and returns the filename for that city's bike share data.

    Args:
        none.
    Returns:
        (str) Filename for a city's bikeshare data.
    '''
    cityname = input('\nHello! Let\'s explore some US bikeshare data!\n'
                 'Would you like to see data for Chicago, New York, or Washington?\n')
    cityname=cityname.lower()
    city_verify=cityname
    if (cityname=='newyork'):
      filename='/Users/pra/Documents/new_york_city.csv'
      city_df=pd.read_csv(filename)
      return city_df,city_verify
    elif (cityname=='washington'):
      filename='/Users/pra/Documents/washington.csv'
      city_df=pd.read_csv(filename)
      return city_df,city_verify
    elif (cityname=='chicago'):
      filename='/Users/pra/Documents/chicago.csv'
      city_df=pd.read_csv(filename)
      return city_df,city_verify
    else:
      print ("Enter only the above given names only")
      exit()

Here I am implementing a function for finding the month with more number of rides so that we can know where we can add more rides and when to reduce so that there will be more utilization of the servie.

In [17]:
def popular_month(city_file):
    '''This function calculates the month that occurs most often in the start time'''
    city_file["Start Time"] = pd.to_datetime(city_file["Start Time"])
    months= city_file["Start Time"].dt.month
    countmonths= Counter(months)
    max_month=countmonths.most_common()[0:1]
    for x in max_month:
      if (x[0]==1):
        y= 'January'
      elif (x[0]==2):
        y='February'
      elif (x[0]==3):
        y='March'
      elif (x[0]==4):
        y='April'
      elif (x[0]==5):
        y='May'
      elif (x[0]==6):
        y='June'
      elif (x[0]==7):
        y='July'
      elif (x[0]==8):
        y='August'
      elif (x[0]==9):
        y='September'
      elif (x[0]==10):
        y='October'
      elif (x[0]==11):
        y='November'
      elif (x[0]==12):
        y='December'
      print ("Month often seen in start time is {} month and count is {}".format (y,x[1]))

Here I am writing a function which find out the which day of the month, hour of the day more numer of rides are being used like first weekend second sundays etc

In [8]:
def popular_day(city_file):
    '''This function calculates the day that occurs most often in the start time
    It takes city dataframe as the argument'''
    city_file["Start Time"] = pd.to_datetime(city_file["Start Time"])
    days= city_file["Start Time"].dt.weekday
    countdays= Counter(days)
    max_day=countdays.most_common()[0:1]
    for x in max_day:
      if (x[0]==0):
        y= 'Sunday'
      elif (x[0]==1):
        y='Monday'
      elif (x[0]==2):
        y='Tuesday'
      elif (x[0]==3):
        y='Wednesday'
      elif (x[0]==4):
        y='Thursday'
      elif (x[0]==5):
        y='Friday'
      elif (x[0]==6):
        y='Saturday'
      print ("Day often seen in start time is {}  weekday and count is {}".format (y,x[1]))


In [9]:
def popular_hour(city_file):
    '''This function calculates the hour that occurs most often in the start time
     It takes city dataframe as the argument'''
    city_file["Start Time"] = pd.to_datetime(city_file["Start Time"])
    hours= city_file["Start Time"].dt.hour
    counthours= Counter(hours)
    max_hour=counthours.most_common()[0:1]
    for x in max_hour:
      print ("Day often seen in start time is {} th hour and count is {}".format (x[0],x[1]))

Here we are implementing the function for finding out the station with more number of rides and the 
trip having maximum rides.

In [10]:
def popular_stations(city_file):
    '''This function calculates the popular start station and the end station
    It takes city dataframe as the argument'''
    start_station=city_file.groupby(["Start Station"]).size()
    end_station=city_file.groupby(["End Station"]).size()
    print ("The most comon start staion is {} and its count is{} ".format(start_station.idxmax(),start_station.max()))
    print ("The most common end station is {} and its count is {}".format(end_station.idxmin(),end_station.max()))


In [11]:
def popular_trip(city_file):
    '''This function calculates the popular trip between start station and the end station'''
    trip=city_file.groupby(["Start Station", "End Station"]).size()
    print ("The most common trip is between start station {} and end station {} and the count is {}".format(trip.idxmax()[0],trip.idxmax()[1],trip.max()))


Here we are finding out the types of users, genders so that they can provide incentives to the group having less count for increasing their business.

In [12]:
def users(city_file):
    '''This function gives the  are the counts of each user type
     It takes city dataframe as the argument'''
    user=city_file['User Type'].tolist()
    coustomer=0
    subscriber=0
    for i in range(len(user)):
      if (user[i]=='Customer'):
        coustomer+=1
      elif (user[i]=='Subscriber'):
        subscriber+=1
    print ("Total number of coustomer users are {}".format(coustomer))
    print ("Total number of subscriber users are {}".format(subscriber))


In [13]:
def genders(city_file):
    '''This function gives the counts of each gender type
    It takes city dataframe as the argument'''
    gender=city_file['Gender'].tolist()
    male=0
    female=0
    for i in range(len(gender)):
      if (gender[i]=='Male'):
        male+=1
      elif (gender[i]=='Female'):
        female+=1
    print ("Total number of male users {}".format(male))
    print ("Total number of femlae users {}".format(female))

Here we are implementing a function where we can find the trip duration and also the age groups who are more and less interested towards these rides.

In [14]:
def trip_duration(city_file):
    '''This function gives the total and average duration of all the trips
    It takes city dataframe as the argument'''
    time_sec=city_file['Trip Duration'].sum()
    ave_time_sec=city_file['Trip Duration'].mean()
    print ("Total  time for trip durations in minutes is {} and in hours is {} and seconds is {}".format((time_sec)/60,(time_sec)/(60*60),time_sec))
    print ("Average time for trip durations in minutes is {} and in hours is {} and seconds is {}".format((ave_time_sec)/60,(ave_time_sec)/(60*60),ave_time_sec))


In [15]:
def birth_years(city_file):
    ''' This function gives the earliest birth year (when the oldest person was born),
    most recent birth year, and most common birth year?
    '''
    age=city_file['Birth Year']
    common_age=city_file.groupby(["Birth Year"]).size()
    print ("The earliest birth year was {}".format(int(age.max())))
    print ("The recent birth year was {}".format(int(age.min())))
    print ("The most comon birth year is {} and its count is{} ".format(common_age.idxmax(),common_age.max()))



This following function includes all the functions described above and finding out our insights.

In [None]:
def statistics():
    '''Calculates and prints out the descriptive statistics about a city and time period
    specified by the user via raw input.

    Args:
        nonegiven    Returns:
        nonegiven    '''
    # Filter by city (Chicago, New York, Washington)7
    city,city_verify = get_city()
    city["Start Time"] = pd.to_datetime(city["Start Time"])
    time_period=input("do you want any filters to be applied that is month day or both or none \n")
    if (time_period=='month'):
        month_given=input("enter month as number for example January=1\n")
        if (1<=int(month_given)<=12):
          city=city.loc[(city['Start Time'].dt.month==int(month_given))]
        else:
          print ("Enter within the range")
          exit()
    elif (time_period=='day'):
      day_given=input("enter day as number for example Sunday=1\n")
      if (0<=int(day_given)<=6):
        city=city.loc[(city['Start Time'].dt.weekday==int(day_given))]
      else:
        print ("Enter with in range")
        exit()
    elif (time_period=='both'):
      month_given=input("Enter the month as number for example January=1 \n")
      day_given=input("enter day \n")
      if (bool(0<=int(day_given)<=6) & bool(1<=int(month_given)<=12)):
        city=city.loc[(city['Start Time'].dt.weekday==int(day_given)) & (city['Start Time'].dt.month==int(month_given))]
      else:
        print ("Enter with  in the range of days")
        exit()
    elif (time_period=='none'):
    	city=city
    else:
      print ("Enter a valid option")
      statistics()
        
    start_time = time.time()
    popular_month(city)

        #TODO: call popular_month function and print the results
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is popular day...")
    start_time = time.time()
    popular_day(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is popular hour...")
    popular_hour(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is popular stations both start and end...")
    popular_stations(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is poplar trip between start and end station...")
    popular_trip(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is count of users...")
    users(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is trip duration...")
    trip_duration(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is genders count...")
    if (city_verify=='washington'):
      print ("The column was not there for the current dataset")
      restart = input('\nWould you like to restart? Type \'yes\' or \'no\'.\n')
      if restart.lower() == 'yes':
        statistics()
      else:
        print ("Exit Application done")
        exit()
    else:
      genders(city)
    print("That took %s seconds." % (time.time() - start_time))
    print("Calculating the next statistic that is recent and old birth years...")
    birth_years(city)
    print("That took %s seconds." % (time.time() - start_time))
    restart = input('\nWould you like to restart? Type \'yes\' or \'no\'.\n')
    if restart.lower() == 'yes':
      statistics()
    else:
        print ("Exit Application done")
statistics()



Hello! Let's explore some US bikeshare data!
Would you like to see data for Chicago, New York, or Washington?
NewyoRk
do you want any filters to be applied that is month day or both or none 
month
enter month as number for example January=1
5
Month often seen in start time is May month and count is 67015
That took 0.07295441627502441 seconds.
Calculating the next statistic that is popular day...
Day often seen in start time is Tuesday  weekday and count is 13422
That took 0.01601123809814453 seconds.
Calculating the next statistic that is popular hour...
Day often seen in start time is 17 th hour and count is 6801
That took 0.03202247619628906 seconds.
Calculating the next statistic that is popular stations both start and end...
The most comon start staion is Pershing Square North and its count is649 
The most common end station is Bike The Branches - Central Branch and its count is 669
That took 0.055040836334228516 seconds.
Calculating the next statistic that is poplar trip between 