In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json, datetime
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
from tensorflow import keras

This script was used to tackle a stimulation of technical problems that could be seen in a job interview process.<br><br>
The topic of this stimulation was taxi drivers and riders, mainly covering data wrangling, data manipulation, exploratory data analysis (EDA), machine learning (specifically random forest and neural networks was used), and model analysis.<br><br>
To run this script, the files logins.json and ultimate_data_challenge.json must be in the same folder as this script. Then, simply run each cell in order.<br><br>
Note: Computation times may be long.

In [None]:
# Loading logins.json as a dictionary:
with open("logins.json", "r") as json_reader:
    logins_dict = json.load(json_reader)

# Loading ultimate_data_challenge.json as a dictionary:
with open("ultimate_data_challenge.json", "r") as json_reader:
    data_dict = json.load(json_reader)

In [None]:
logins_dict

In [None]:
print(logins_dict.keys())

In [None]:
print(len(logins_dict['login_time']))
print(type(logins_dict['login_time'][0]))
print(logins_dict['login_time'][-1])

In [None]:
data_dict

In [None]:
print(type(data_dict))
print(data_dict[0].keys())
print(len(data_dict))

Part 1 ‑ Exploratory data analysis:

"The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15 minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them."


Looking at the loaded login.json file, the the data was stored/loaded in as a dictionary with only one key (login_time) having a list of 93142 elements as the value. The values are dates that were loaded in as string data type.

To answer the problem, I want to visualize the demands each 15 minutes by weekdays. To do this, the following procedure will be performed:
- Converting the dictionary into a list/extracting the single list from the dictionary
- Convert the values into a time data type
- Count the number of demands per 15 minutes
- Assign weekday to each 15 minute interval
- Graph the demands:
    - Daily
    - Weekly
    - Monthly
    
One thing to note about the data in logins.json and how the script was written:
- The dates range from 1-1-1970 to 4-13-1970 (checked using the max() function)
- 15 minute intervals start at midnight of 1-1-1970 (at 0 hours: 0 minutes: 0 seconds)

In [None]:
# Extracting the single value of the dictionary, logins_dict:
logins_list = logins_dict['login_time']

In [None]:
# Converting the values into datetime objects:
logins_list_converted = [datetime.strptime(date, '%Y-%m-%d %H:%M:%S') for date in logins_list]

# Checking the last/latest date in the list:
print(max(logins_list_converted))

In [None]:
# Creating a new DataFrame to count the number of timestamps within 15 minutue intervals:
start_datetime = datetime(year= 1970, month= 1, day= 1)

## The end_datetime was determined by rounding up the last data entry to the next 15 minute interval (1970-04-13 18:57:38):
end_datetime = datetime(year= 1970, month= 4, day= 13, hour= 19)

time_interval = timedelta(minutes= 15)

changing_datetime = start_datetime

interval_list = []

while changing_datetime >= start_datetime and changing_datetime < end_datetime:
    interval_list.append(changing_datetime)
    changing_datetime = changing_datetime + time_interval

login_df_interval = pd.DataFrame({'TIMESTAMP_START_INTERVAL': interval_list, 
                                 'COUNT': np.NaN, 
                                 'WEEK': np.NaN, 
                                 'WEEKDAY': np.NaN, 
                                 'MONTH': np.NaN})

for index_interval, date_interval in login_df_interval.iterrows():
    start_time = date_interval.TIMESTAMP_START_INTERVAL
    end_time = date_interval.TIMESTAMP_START_INTERVAL + time_interval
    
    counter = 0
    
    ## The TIMESTAMP_START_INTERVAL refers to the start time to the second before the next TIMESTAMP_START_INTERVAL:
    for time_logins in logins_list_converted:
        if time_logins >= start_time and time_logins < end_time:
            counter += 1
    
    login_df_interval.loc[index_interval, 'COUNT'] = counter
    
    ## Adding week number, weekday, and month to the DataFrame:
    date_tuple = tuple()
    week_number = np.NaN
    weekday_int = np.NaN
    weekday_str = str()
    month = np.NaN
    
    ### .isocalendar() returns a tuple, (year, week number, weekday), starting at 1/"iso" [not 0]:
    date_tuple = date_interval.TIMESTAMP_START_INTERVAL.isocalendar()
    
    week_number = date_tuple[1]
    
    weekday_int = date_tuple[2]
    
    month = date_interval.TIMESTAMP_START_INTERVAL.date().month
    
    ### Assigning string weekday to each date:
    if weekday_int == 1:
        weekday_str = 'Monday'
        
    elif weekday_int == 2:
        weekday_str = 'Tuesday'
        
    elif weekday_int == 3:
        weekday_str = 'Wednesday'
        
    elif weekday_int == 4:
        weekday_str = 'Thursday'
        
    elif weekday_int == 5:
        weekday_str = 'Friday'
        
    elif weekday_int == 6:
        weekday_str = 'Saturday'
        
    elif weekday_int == 7:
        weekday_str = 'Sunday'
        
    else:
        weekday_str = np.NaN
    
    login_df_interval.loc[index_interval, 'WEEK'] = week_number
    login_df_interval.loc[index_interval, 'WEEKDAY'] = weekday_str
    login_df_interval.loc[index_interval, 'MONTH'] = month
            
    

Exploratory Data Analysis Begins: -----------------------------------------------------------------------------------------

In [None]:
################################################## Single Date #############################################################

## Graphing the date with the most count in a 15 minute period:
max_count_date = pd.Timestamp(login_df_interval.TIMESTAMP_START_INTERVAL[login_df_interval.COUNT == max(login_df_interval.COUNT)].values[0]).date()

max_count_date_df = pd.DataFrame(columns=login_df_interval.columns)

for index, row in login_df_interval.iterrows():
    date = row.TIMESTAMP_START_INTERVAL.date()

    
    if date == max_count_date:
        temp_df = pd.DataFrame({'TIMESTAMP_START_INTERVAL': row.TIMESTAMP_START_INTERVAL, 
                        'COUNT': row.COUNT, 
                        'WEEK': row.WEEK, 
                        'WEEKDAY': row.WEEKDAY, 
                        'MONTH': row.MONTH}, index=[0])
            
        max_count_date_df = pd.concat([max_count_date_df, temp_df], keys= row.index, ignore_index= True)

plt.figure(figsize=(20, 6))        
plt.plot(max_count_date_df.TIMESTAMP_START_INTERVAL, max_count_date_df.COUNT)
plt.xticks(rotation=90)
plt.xlabel('Time (Month-Day Hour)')
plt.ylabel('Demand Per 15 Minute Interval')
plt.title('Most Count in a Single 15 Minute Interval, 3-1-1970')
plt.show()

In [None]:
################################################ Daily Comparison ###########################################################
# Comparing by weekday, Cumulative:
weekday_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

#### This list will be used in the next cell for weekly comparison:
total_week_sum = []

for weekday in weekday_list:
    weekday_df = pd.DataFrame(columns=['TIMESTAMP_START_INTERVAL', 'COUNT','NUMBER_OF_OCCURANCES','AVERAGE'])
    temp_df = pd.DataFrame()
    temp_list = []
    
    for index, row in login_df_interval.iterrows():
        if weekday == row.WEEKDAY:
            interval = row.TIMESTAMP_START_INTERVAL.time().isoformat()
            
            if interval in list(weekday_df.TIMESTAMP_START_INTERVAL):
                weekday_df.loc[weekday_df.TIMESTAMP_START_INTERVAL == interval, 'COUNT'] = weekday_df.loc[weekday_df.TIMESTAMP_START_INTERVAL == interval].COUNT.values[0] + row.COUNT
                weekday_df.loc[weekday_df.TIMESTAMP_START_INTERVAL == interval, 'NUMBER_OF_OCCURANCES'] = weekday_df.loc[weekday_df.TIMESTAMP_START_INTERVAL == interval].NUMBER_OF_OCCURANCES.values[0] + row.COUNT
                
                average = weekday_df.COUNT[weekday_df.TIMESTAMP_START_INTERVAL == interval].values[0] / weekday_df.NUMBER_OF_OCCURANCES[weekday_df.TIMESTAMP_START_INTERVAL == interval].values[0]
                weekday_df.loc[weekday_df.TIMESTAMP_START_INTERVAL == interval, 'AVERAGE'] = average
            
            else:
                temp_df = pd.DataFrame({'TIMESTAMP_START_INTERVAL': interval, 
                                        'COUNT': row.COUNT,
                                        'NUMBER_OF_OCCURANCES': 1,
                                        'AVERAGE': row.COUNT}, 
                                       index=[0])

                weekday_df = pd.concat([weekday_df, temp_df], ignore_index= True)
            
            if row.WEEK != 1 and row.WEEK != 16:
                temp_list.append(row.COUNT)
    
    
    total_week_sum.append(sum(temp_list))
    
    plt.figure(figsize=(20, 6))        
    plt.plot(weekday_df.TIMESTAMP_START_INTERVAL, weekday_df.COUNT)
    plt.xticks(rotation=90)
    plt.ylim(0, 700)
    plt.xlabel('Time (Month-Day Hour)')
    plt.ylabel('Demand Per 15 Minute Interval (Cumulative)')
    plt.title('All ' + weekday + 's')
        
plt.show()

In [None]:
################################################ Weekly Comparison ###########################################################
# Comparing each week:
week_number_range = range(1, int(max(login_df_interval.WEEK)) + 1)

week_counts = []

for week_number in week_number_range:
    temp_list = []
    
    for index, row in login_df_interval.iterrows():
        if week_number == row.WEEK:
            temp_list.append(row.COUNT)
    
    week_counts.append(sum(temp_list))
    

plt.figure(figsize=(20, 6))        
plt.plot(week_number_range, week_counts)
plt.xticks(week_number_range, rotation=90)
plt.hlines(week_counts[0]/3*7, xmin= 1, xmax= 2) ## This is the daily average for the first week, multiplied by 7 days
plt.hlines(week_counts[-1]/2*7, xmin= 15, xmax= 16) ## This is the daily average for the last week, multiplied by 7 days
plt.xlabel('Week Number (For Year 1970)')
plt.ylabel('Demand (Cumulative)')
plt.title('Weeks (January-April, 1970)')
        
plt.show()

In [None]:
################################################ Monthly Comparison #########################################################
# Comparing each month:
month_number_range = range(1, int(max(login_df_interval.MONTH)) + 1)

month_counts = []

for month_number in month_number_range:
    temp_list = []
    
    for index, row in login_df_interval.iterrows():
        if month_number == row.MONTH:
            temp_list.append(row.COUNT)
    
    month_counts.append(sum(temp_list))
    
plt.figure(figsize=(20, 6))        
plt.plot(month_number_range, month_counts)
plt.xticks(month_number_range, rotation=90)
plt.hlines(month_counts[-1]/13*30, xmin= 3, xmax= 4) ## This is the daily average of April, multiplied by 30 days
plt.xlabel('Month (Numeric)')
plt.ylabel('Demand (Cumulative)')
plt.title('Demand During January-April, 1970')
        
plt.show()

After examining the patterns of demand in the various graphs generated, here is a summary of some of the patterns seen:
- Daily Graphs:
    - Monday:
		- High Demand:
			- The highest trend of the day is typically at midday, with the trend starting around 8:30 AM, peaking at 11:00 AM - 11:45 AM, and then demand decreases until 1:30 PM
			- There is also significant uptrends that begins at 7:00 PM, peaking and plateauing from 8:45 PM to the next day (Tuesday) 1:00 AM
				- This uptrend peaks and plateaus about half of the max demand during midday 
			- Demand from Sunday night also carries over into the early morning of Monday and begins decreasing after 12:30 AM
				- The demand is also half of the max demand of Monday midday
		- Low Demand:
			- The lowest ranges are during the morning hours between 6:00 AM to the start of the highest trend at 8:30 AM
				- The lowest demand is between 6:15 AM - 6:30 AM, which is significantly lower then rest of the day (seems to almost be in the single digits of counts even though the graph shows the cumulative demand of all Mondays)
			- There is also another low range after the highest trend ends at 1:30 PM and lastest until the start of the second uptrend at 7:00 PM

	- Tuesday - Thursday:
		- Tuesday - Thursday follow the same pattern as Monday except that the demand in the night hours (including the hours from midnight to 6:00 AM) increases as the days progress through the week
			- On Thursday, the high demand from 10:30 PM - 11:00 PM is almost the same demand as seen during the midday peak 
			
	- Friday:
		- Similar to the Thursday, Friday has a significant increase to the demands during night hours, which peaks about 1.5 times more than the midday demand.
		
	- Saturday:
		- Saturday's pattern carries over from Friday and peaks at 4:45 A.M.
		- A low demand range follows after until 10:00 AM, where the demand begins to uptrend and plateaus at 1:00 PM until 8:00 PM, which begins a new uptrend that continues into Sunday
		
	- Sunday:
		- Sunday is similar to Saturday's pattern except it has the highest demand peak of the weak between 4:30 AM - 5:00 AM
		- However, the night hours of Sunday does not uptrend and continues the plateau that begins from midday and lasts into Monday 1:00 AM
		
	- Note:
		- Because this is a cumluative of all counts and there are not the equal amounts of each day of week, some days will have less or higher values. However, since there are at most 15 Sundays, Mondays, Thursdays, Fridays, and Saturdays, and at least 14 Tuesdays and Wednesdays, there should not be a significant difference in the aggregation of counts. This could be seen in the Tuesday and Wednesday daily graphs as the demand can still be seen to increase in the night hours comparatively to Monday's daily graph.
		
- Weekly Graphs:
	- The demand seem to increase as the weeks progress
	- Note:
		- Because Week 1 only contains Thursday - Saturday and Week 16 only contains Sunday and Monday, these two weeks are not good to use as comparisons to the other weeks
			- Two black horizontal lines were added to extrapolate/simulate the rest of the week's demand
				- However, this extrapolation was done with simple averaging and then multiplying 7 days--and as shown on the graphs, significantly undershoots or overshoots
					- This could be aproached using various methods (such as filling in the missing dates with random forest or filling the missing days with daily averages)--however, since there isn't enough data for better weekly and monthly analysis, these methods could also undershoot and overshoot just as dipicted by the two horizontal lines.

- Monthly Graphs:
	- With the data provided, the demand seems to generally increase as the months progressed
	- The horizontal was to extrapolate for April (Month 4) by dividing the total counts in April by 13 days and then multiplying by 30 days.

- Note:
	- All graphs, the last data entry in the dataset is on 4-13-1970 at 18:54:23, meaning that there was not enough data to complete the entire day of 4-13-1970
    
In summary:<br>
    For Monday - Friday, the demand peaks at midday and in the night hours. As the week progresses, the demand in the night hours begin to increase to the point where on Fridays, the night hour peaks are larger than then the midday peak. The graphs for Monday - Friday resemebles a "W"-shape. <br>
    For Saturday and Sunday, the demands are highest at the early morning hours (before 5 AM) and then begins a second uptrend after 1 PM. The graphs for Saturday and Sunday resemebles a "U"-shape.

###################################################################################################

Part 2 ‑ Experiment and metrics design
The neighboring cities of Gotham and Metropolis have complementary circadian rhythms: on
weekdays, Ultimate Gotham is most active at night, and Ultimate Metropolis is most active
during the day. On weekends, there is reasonable activity in both cities.
However, a toll bridge, with a two way
toll, between the two cities causes driver partners to tend
to be exclusive to each city. The Ultimate managers of city operations for the two cities have
proposed an experiment to encourage driver partners to be available in both cities, by
reimbursing all toll costs.
1. What would you choose as the key measure of success of this experiment in
encouraging driver partners to serve both cities, and why would you choose this metric?
2. Describe a practical experiment you would design to compare the effectiveness of the
proposed change in relation to the key measure of success. Please provide details on:
a. how you will implement the experiment
b. what statistical test(s) you will conduct to verify the significance of the
observation
c. how you would interpret the results and provide recommendations to the city
operations team along with any caveats.

Question 1:
I would choose the traffic of driver partners that uses the toll bridge before and after the reimbursement experiment as the main metric. If there are other longer, less direct routes between the cities, the toll bridge could be faster and more cost-efficient to take after reimbursement, which increases traffic on the toll bridge. If the toll bridge is the only route, then there should be a increase of driver partners who are more willing travel across the bridge without having to factor in the tolls. If possible, examining the driver partners who rarely use the toll bridge before reimbursement would be the main determinants of success as this seems to be the targets of the policy.

Question 2:<br>
A) The data to record before the policy is in effect (or announced) is the amount of times each driver partner pass through the toll bridge, as well as the direction of the travel. Then, after a month, put the policy into effect and record the same data for another two months. For the three months, it would be better to choose months with similar consecutive traffic patterns to minimize the effect of other variables. A example is experimenting according to summer months after June as students are in summer vacation instead of May-July where students are in school at the beginning and out of school at the end of the period. Unless there is already previous data avaliable (at least one year), then simply put execute the policy and record the data. Another metric to record is the amount of reimbursement being allocated to the driver partners and amount of tolls collected from the toll bridge.<br><br>
B) Depending on the results and length of the data, it would be better to ignore the first few months (if possible) after the policy is in place as driver partners could still be adjusting to the change--expecting a plateau of driver partners using the toll bridge after a time has passed. The type of statistics to use depends of the results of the data. If the traffic has dramatically increased for driver partners who were labelled as city-exclusive, then simple statistics should be sufficient. If the traffic has not largely increased or did not increase at all for city-exclusive driver partners, then obtaining a p-value could be a measurement of success. In all cases, examining the difference between the tolls paid by driver partners before the policy and the total reimbursement amount would show the complete overview of traffic by all driver partners.<br><br>
C) Depending on the goals of the city operations team, if the traffic has largely increased for city-exclusive driver partners and there is a large difference in reimbursement vs. tolls collected from driver partners (more reimbursement than tolls collected), then I would say that the policy does increase traffic between each city. If the difference is small (or even if a p-value is needed to determine significance), I would say that the policy might be statisitically significant but does not have a large effect in the overall picture of making driver partners non-exclusive to a specific city. 

##################################################################################################

Part 3 ‑ Predictive modeling
Ultimate is interested in predicting rider retention. To help explore this question, we have
provided a sample dataset of a cohort of users who signed up for an Ultimate account in
January 2014. The data was pulled several months later; we consider a user retained if they
were “active” (i.e. took a trip) in the preceding 30 days.
We would like you to use this data set to help understand what factors are the best predictors
for retention, and offer suggestions to operationalize those insights to help Ultimate.
The data is in the attached file ultimate_data_challenge.json. See below for a detailed
description of the dataset. Please include any code you wrote for the analysis and delete the
dataset when you have finished with the challenge.
1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided
data for this analysis (a few sentences/plots describing your approach will suffice). What
fraction of the observed users were retained?
2. Build a predictive model to help Ultimate determine whether or not a user will be active
in their 6th month on the system. Discuss why you chose your approach, what
alternatives you considered, and any concerns you have. How valid is your model?
Include any key indicators of model performance.
3. Briefly discuss how Ultimate might leverage the insights gained from the model to
improve its longterm
rider retention (again, a few sentences will suffice).

In [None]:
# Question 1 - Data Cleaning:
## The JSON file has already been loading in the first few cells.

# Creating a DataFrame from the JSON file loaded in as a list of dictionaries, data_dict:
rider_df = pd.DataFrame(columns= list(data_dict[0].keys()).append('active'))

date_difference = timedelta(days= 30)

current_date = datetime.strptime('2014-07-01', '%Y-%m-%d') # This is the latest date in the JSON file

for data in data_dict:
    temp_df = pd.DataFrame(data, index=[0])
    
    active_status = np.NaN
    last_date = datetime.strptime(data['last_trip_date'], '%Y-%m-%d')
    
    if last_date + date_difference >= current_date:
        active_status = 'Active'
        
    else:
        active_status = 'Inactive'
    
    temp_df['active'] = active_status

    rider_df = pd.concat([rider_df, temp_df], ignore_index= True)

In [None]:
rider_df

In [None]:
# Question 1 - Exploratory Data Analysis:

## Want to count the cities of users:
plt.subplot(1, 2, 1)
plt.hist(rider_df.city[rider_df.active == 'Active'])
plt.ylim(0, 16000)
plt.xlabel('City')
plt.ylabel('Number of Active Users')
plt.subplot(1, 2, 2)
plt.hist(rider_df.city[rider_df.active == 'Inactive'], color= 'red')
plt.ylim(0, 16000)
plt.xlabel('City')
plt.ylabel('Number of Inactive Users')
plt.tight_layout()
plt.show()

In [None]:
## Want to see the Operating System/Device:
plt.subplot(1, 2, 1)
plt.hist(rider_df.phone[rider_df.active == 'Active'].fillna('NA'))
plt.ylim(0, 20000)
plt.xlabel('Operating System')
plt.ylabel('Number of Active Users')
plt.subplot(1, 2, 2)
plt.hist(rider_df.phone[rider_df.active == 'Inactive'].fillna('NA'), color= 'red')
plt.ylim(0, 20000)
plt.xlabel('Operating System')
plt.ylabel('Number of Inactive Users')
plt.tight_layout()
plt.show()

In [None]:
## Want to see the User's Rating (by driver) of "active users":
plt.subplot(1, 2, 1)
plt.hist(rider_df.avg_rating_by_driver[rider_df.active == 'Active'])
plt.ylim(0, 30000)
plt.xlabel('Average Rating (by Drivers)')
plt.ylabel('Number of Active Users')
plt.subplot(1, 2, 2)
plt.hist(rider_df.avg_rating_by_driver[rider_df.active == 'Inactive'], color = 'red')
plt.ylim(0, 30000)
plt.xlabel('Average Rating (by Drivers)')
plt.ylabel('Number of Inactive Users')
plt.tight_layout()
plt.show()

In [None]:
## Want to see the average distance of trips:
plt.subplot(1, 2, 1)
plt.hist(rider_df.avg_dist[rider_df.active == 'Active'], bins=60)
plt.ylim(0, 10000)
plt.xlim(0, 60)
plt.xlabel('Average Distance of Trips')
plt.ylabel('Number of Active Users')
plt.subplot(1, 2, 2)
plt.hist(rider_df.avg_dist[rider_df.active == 'Inactive'], color = 'red', bins=60)
plt.ylim(0, 10000)
plt.xlim(0, 60)
plt.xlabel('Average Distance of Trips')
plt.ylabel('Number of Inactive Users')
plt.tight_layout()
plt.show()

In [None]:
## Want to see the percentage of the total trips made during a weekday:
plt.subplot(1, 2, 1)
plt.hist(rider_df.weekday_pct[rider_df.active == 'Active'], bins=60)
plt.ylim(0, 8000)
plt.xlim(0, 60)
plt.xlabel('Percentage of Trips Made on a Weekday')
plt.ylabel('Number of Active Users')
plt.subplot(1, 2, 2)
plt.hist(rider_df.weekday_pct[rider_df.active == 'Inactive'], color = 'red', bins=60)
plt.ylim(0, 8000)
plt.xlim(0, 60)
plt.xlabel('Percentage of Trips Made on a Weekday')
plt.ylabel('Number of Inactive Users')
plt.tight_layout()
plt.show()

In [None]:
retention_percentage = len(rider_df[rider_df.active == 'Active']) / len(rider_df)
print(retention_percentage)

Question 1 - What fraction of the observed users were retained?:<br><br>
Of the fraction of total users, 37.608% of users who join in Janurary 2014 are considered "active".

In [None]:
# Question 2 - Predictive Modeling:
## For the model I will be using is Random Forest because it is complex enough to deal with the number of variables.
## Also, because random forest was used, cross-validation is not necessary as the variable shuffling and dataset will be 
## shuffled.
### A neural network model was included in cells further.
## Columns with dates will be omitted.

# Changing string values in rider_df to numerical values:
rider_df_numeric = pd.DataFrame(columns=list(['city', 'trips_in_first_30_days', 'avg_rating_of_driver','avg_surge', 
                                              'phone', 'surge_pct','ultimate_black_user', 'weekday_pct', 'avg_dist',
                                              'avg_rating_by_driver']))
label_list = []

for index, user in rider_df.iterrows():
    if user['city'] == "King's Landing":
        rider_df_numeric.loc[index, 'city'] = 0
        
    elif user['city'] == "Astapor":
        rider_df_numeric.loc[index, 'city'] = 1
        
    else:
        rider_df_numeric.loc[index, 'city'] = 3
        
    rider_df_numeric.loc[index, 'trips_in_first_30_days'] = user['trips_in_first_30_days']

    if np.isnan(user['avg_rating_of_driver']): ## Replacing np.NaN with average
        rider_df_numeric.loc[index, 'avg_rating_of_driver'] = np.nanmean(rider_df.avg_rating_of_driver)
    
    else:
        rider_df_numeric.loc[index, 'avg_rating_of_driver'] = user['avg_rating_of_driver']
        
    rider_df_numeric.loc[index, 'avg_surge'] = user['avg_surge']
    
    if user['phone'] == 'iPhone':
        rider_df_numeric.loc[index, 'phone'] = 0
        
    elif user['phone'] == 'Android':
        rider_df_numeric.loc[index, 'phone'] = 1
    
    else:
        rider_df_numeric.loc[index, 'phone'] = 0  ## All np.NaN are replaced with iPhone
    
    rider_df_numeric.loc[index, 'surge_pct'] = user['surge_pct']
    rider_df_numeric.loc[index, 'ultimate_black_user'] = int(user['ultimate_black_user'])
    rider_df_numeric.loc[index, 'weekday_pct'] = user['weekday_pct']
    rider_df_numeric.loc[index, 'avg_dist'] = user['avg_dist']
    
    if np.isnan(user['avg_rating_by_driver']): ## Replacing np.NaN with average
        rider_df_numeric.loc[index, 'avg_rating_by_driver'] = np.nanmean(rider_df.avg_rating_by_driver)
        
    else:
        rider_df_numeric.loc[index, 'avg_rating_by_driver'] = user['avg_rating_by_driver']

    if user['active'] == 'Inactive':
        label_list.append(0)
        
    else:
        label_list.append(1)
                
rider_df_numeric = rider_df_numeric.astype(dtype=np.float64) ## Oddly, the data types changed to "object"
rider_df_label = pd.factorize(rider_df['active'])[0]

# Splitting the data into test and train datasets:
X_train, X_test, Y_train, Y_test = train_test_split(rider_df_numeric, rider_df_label, test_size= 0.25, shuffle= True)

# Optimizing hyperparameter for best number of trees:
trees_list = list(range(64, 129))
accuracy_score_list = []

for trees in trees_list:
    random_forest = RandomForestClassifier(n_estimators= trees, max_features= 'sqrt', n_jobs= -1)
    random_forest.fit(X_train, Y_train)
    accuracy_score_list.append(random_forest.score(X_test, Y_test))
    
## Selecting the best number of trees
best_accuracy_score = max(accuracy_score_list)

for element, score in enumerate(accuracy_score_list):
    if score == best_accuracy_score:
        best_trees = trees_list[element]
        
random_forest = RandomForestClassifier(n_estimators= best_trees, max_features= 'sqrt', n_jobs= -1)
random_forest.fit(X_train, Y_train)
predictions = random_forest.predict(X_test)
accuracy_score = random_forest.score(X_test, Y_test)

## Extracting the weights for each input variable:
feature_coefficients = random_forest.feature_importances_

temp_dict = {'city': feature_coefficients[0], 
             'trips_in_first_30_days': feature_coefficients[1], 
             'avg_rating_of_driver': feature_coefficients[2],
             'avg_surge': feature_coefficients[3], 
             'phone': feature_coefficients[4], 
             'surge_pct': feature_coefficients[5],
             'ultimate_black_user': feature_coefficients[6], 
             'weekday_pct': feature_coefficients[7], 
             'avg_dist': feature_coefficients[8],
             'avg_rating_by_driver': feature_coefficients[9]}

feature_coefficients_df = pd.DataFrame(temp_dict, index=[0])

In [None]:
# Neural Network Model:
nn_model = keras.Sequential()
nn_model.add(keras.layers.Dense(20, activation='relu', input_shape=(10,)))
nn_model.add(keras.layers.Dense(10, activation='relu'))
nn_model.add(keras.layers.Dense(1, activation='sigmoid'))
nn_model.compile('adam', loss= 'binary_crossentropy', metrics=['accuracy']) # Accuracy can be used since the problem is categorical
nn_model.fit(X_train, Y_train, epochs=50, validation_split= 0.20)
nn_model_prediction = nn_model.predict(X_test)
nn_model.evaluate(X_test, Y_test)

In [None]:
print(accuracy_score)
print(best_trees)
feature_coefficients_df

In [None]:
# Obtaining Random Forest Metrics:
print(classification_report(Y_test, predictions))

In [None]:
# Obtaining Neural Network Metrics:
nn_predictions_convert = []

for predict in nn_model_prediction:
    nn_predictions_convert.append(round(predict[0]))
    
print(classification_report(Y_test, nn_predictions_convert))

Question 2 - Model Analysis:<br><br>
    The model using basic Random Forest performed sufficiently with an accuracy of ~75%. Other model types could of been used, such as kNN, but since the goal of the question is what feature of the user could increase usage, the simple extraction of feature weights/coefficients are sufficient for this purpose. A simple neural network was created and the model had accuracy score of ~76%, which is similar to the performance of the random forest model. However, since the nature of neural networks does not provide a uniform way of determining feature importance, it can only be used for predictions.<br><br>
    The three features with highest impact according to the random forest model are the average distance of the trips, trips made during the weekdays, and the average user rating given by the driver. However, when comparing with the histograms, the difference between active and inactive users was not apparent.<br><br>
    Because the question is a binary classification problem, the accuracy would be the best statistic to determining the performance of the model. Looking at different statistics for both models, the models were better at predicting active users (1) than inactive users (0).

Question 3:<br><br>
    Ultimate Technologies can attempt to market their product according to the features weighted most important by the random forest model. An example could be if many of the trips made during the weekday are from people being driven to and from work, then marketing towards workers would bring more of the same type of customers. Another example could be to create a deal or program for customers who only need to travel a small distance as most customers travel less than 20 miles in one trip. For high rated users who have made enough trips greater than a certain number of rides, these customers could be sent deals to encourage more incentive to choose Ultimate. 