Build a regression model.

Before building the model I want to better classify the POI 

In [49]:
%store -r bike_station_information_df
%store -r yelp_result_df
%store -r fourSquare_result_df
%store -r final_df

final_df = final_df
import pandas as pd
import numpy as np

In [50]:

# Data Audit
# 1. Check for Missing Values
missing_values = final_df.isnull().sum()
print("Missing Values:")
print(missing_values)

# 2. Check for Duplicates
duplicates = final_df.duplicated()
print("\nDuplicate Rows:")
print(final_df[duplicates])

# 3. Explore Data Types
data_types = final_df.dtypes
print("\nData Types:")
print(data_types)

# 4. Explore Unique Values
unique_values = final_df.nunique()
print("\nNumber of Unique Values:")
print(unique_values)

# 5. Check for Outliers (Assuming 'Rating' is a numerical column)
outliers = final_df.describe()
print("\nDescriptive Statistics:")
print(outliers)

Missing Values:
City                 0
Latitude             0
Longitude            0
Number of Bikes      0
Key                  0
Name                 0
Category             0
Distance             0
Rating             440
dtype: int64

Duplicate Rows:
Empty DataFrame
Columns: [City, Latitude, Longitude, Number of Bikes, Key, Name, Category, Distance, Rating]
Index: []

Data Types:
City                object
Latitude           float64
Longitude          float64
Number of Bikes      int64
Key                  int64
Name                object
Category            object
Distance           float64
Rating             float64
dtype: object

Number of Unique Values:
City                 1
Latitude            44
Longitude           44
Number of Bikes     13
Key                 44
Name               360
Category           163
Distance           738
Rating               9
dtype: int64

Descriptive Statistics:
         Latitude   Longitude  Number of Bikes         Key      Distance  \
count  849.

In [51]:
# Handling Missing Values

# Count occurrences of 'Nan' in 'Rating' 
missing_values_count = final_df['Rating'].isna().sum()
print(f"Occurrences of 'NaN' in Rating: {missing_values_count}")

# Count occurrences of 'N/A' in 'Category' 
category_na_count = (final_df['Category'] == 'N/A').sum()
print(f"Occurrences of 'N/A' in Category: {category_na_count}")

# Filling NaN values in the 'Rating' column with the mean
final_df['Rating'].fillna(final_df['Rating'].mean(), inplace=True)
final_df['Rating'] = final_df['Rating'].round(2)


# Dropping rows with 'N/A' in 'Category' or 'GroupedCategory'
final_df = final_df[(final_df['Category'] != 'N/A')]
category_na_count = (final_df['Category'] == 'N/A').sum()


# Count occurrences of 'Nan' in 'Rating' after cleaning and dropping rows
missing_values_count = final_df['Rating'].isna().sum()
print(f"Occurrences of 'NaN' in Rating after Clean: {missing_values_count}")

# Count occurrences of 'N/A' in 'Category after cleaning and dropping rows
category_na_count = (final_df['Category'] == 'N/A').sum()
print(f"Occurrences of 'N/A' in Category after Clean: {category_na_count}")



Occurrences of 'NaN' in Rating: 440
Occurrences of 'N/A' in Category: 25
Occurrences of 'NaN' in Rating after Clean: 0
Occurrences of 'N/A' in Category after Clean: 0


In [52]:
def categorize(category):
  # Restaurants
  if any(keyword in category for keyword in ['Bakery', 'Restaurant', 'French', 'Japanese', 'Pizza', 'Indian', 'Pan Asian', 'Moroccan', 'Thai', 'Bistros', 'Pizzeria', 'Burger Joint', 'Steakhouses', 'Italian', 'Seafood', 'Mexican', 'Chinese', 'Fast Food']):
    return 'Restaurant'
  
  # Bars and Lounges
  elif any(keyword in category for keyword in ['Bar', 'Pub', 'Lounge']):
    return 'Bar/Lounge'
  
  # Health Centers
  elif any(keyword in category for keyword in ['Hospital', 'Health', 'Drugstore']):
    return 'Health Center'
  
  # Retails
  elif any(keyword in category for keyword in ['Computers and Electronics Retail', 'Store', 'Shop', 'Retail', 'Bookstore', 'Botanical Gardens']):
    return 'Retail'
  
  # Education
  elif any(keyword in category for keyword in ['School', 'Education', 'Library']):
    return 'Education'
  
  # Entertainment
  elif any(keyword in category for keyword in ['Music Venue', 'Theater', 'Cinema', 'Arts and Entertainment', 'Paintball Field']):
    return 'Entertainment'
  
  # Home Services
  elif any(keyword in category for keyword in ['Carpenter', 'Home Improvement Service', 'Electrician', 'Plumber', 'Interior Designer', 'Roof Deck', 'Heating, Ventilating and Air Conditioning Contractor']):
    return 'Home Services'
  
  # Landmarks and Historical Buildings
  elif any(keyword in category for keyword in ['Monument', 'Scenic Lookout', 'Park', 'Landmarks & Historical Buildings', 'Lakes']):
    return 'Landmarks & Historical Buildings'
  
  # Business and Professional Services
  elif any(keyword in category for keyword in ['Management Consultant', 'Business and Professional Services', 'Employment Agency', 'Organization', 'Professional Cleaning Service', 'Accounting and Bookkeeping Service']):
    return 'Business Services'
  
  # Other Services
  elif any(keyword in category for keyword in ['Computer Repair Service', 'Audiovisual Service', 'Post Office', 'Utility Company']):
    return 'Other Services'
  
  elif 'N/A' in category:
    return 'N/A'
  
  # Miscellaneous
  else:
    return 'Miscellaneous'


final_df['GroupedCategory'] = final_df['Category'].apply(categorize)

# Checking for all unique values in the 'grouped Category' column
unique_new_categories = final_df['GroupedCategory'].unique()
print(unique_new_categories)


final_df[final_df['Key'] == 0].head(10)

['Education' 'Retail' 'Restaurant' 'Health Center' 'Bar/Lounge'
 'Other Services' 'Miscellaneous' 'Landmarks & Historical Buildings'
 'Business Services' 'Home Services' 'Entertainment']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['GroupedCategory'] = final_df['Category'].apply(categorize)


Unnamed: 0,City,Latitude,Longitude,Number of Bikes,Key,Name,Category,Distance,Rating,GroupedCategory
0,Cergy,49.034625,2.067596,5,0,Ecole de biologie industrielle,Education,259.0,3.71,Education
1,Cergy,49.034625,2.067596,5,0,Esprit Clean,Fashion Accessories Store,283.0,3.71,Retail
2,Cergy,49.034625,2.067596,5,0,Le Millésime du Port,French Restaurant,634.0,3.71,Restaurant
3,Cergy,49.034625,2.067596,5,0,La Taverne des Rois,Restaurant,648.0,3.71,Restaurant
4,Cergy,49.034625,2.067596,5,0,Au Fourmont Village,Bakery,359.0,3.71,Restaurant
7,Cergy,49.034625,2.067596,5,0,École Maternelle Publique le Village,Elementary School,397.0,3.71,Education
8,Cergy,49.034625,2.067596,5,0,France Depannage Auto,Miscellaneous Store,425.0,3.71,Retail
9,Cergy,49.034625,2.067596,5,0,Pharmacie des Trois Gares,Drugstore,448.0,3.71,Health Center
440,Cergy,49.034625,2.067596,5,0,La Rotisserie Ô,French,500.015071,4.5,Restaurant
441,Cergy,49.034625,2.067596,5,0,La Cavallina,Italian,496.288387,5.0,Restaurant


In [53]:
import pandas as pd

# Creating a new dataframe using information from the combined dataframe (final_df)
grouped_df = pd.pivot_table(final_df, 
                            index=['Key', 'Number of Bikes', 'Latitude', 'Longitude'],
                            columns='GroupedCategory', 
                            values='Category', 
                            aggfunc='count', 
                            fill_value=0).reset_index()
grouped_df.columns.name = None

grouped_df.head(10)

Unnamed: 0,Key,Number of Bikes,Latitude,Longitude,Bar/Lounge,Business Services,Education,Entertainment,Health Center,Home Services,Landmarks & Historical Buildings,Miscellaneous,Other Services,Restaurant,Retail
0,0,5,49.034625,2.067596,1,0,2,0,1,0,0,0,0,12,2
1,1,3,49.03632,2.079568,6,0,0,0,0,0,0,2,1,6,5
2,2,5,49.017697,2.099417,0,0,3,0,1,0,0,5,1,6,3
3,3,6,49.044092,2.036546,0,1,1,0,0,1,4,7,2,4,0
4,4,9,49.030665,2.083067,3,2,3,0,0,0,0,6,0,6,0
5,5,7,49.035511,2.061921,0,0,1,0,1,0,0,1,0,13,3
6,6,5,49.052785,2.017956,0,0,1,0,1,1,0,6,0,8,3
7,7,6,49.038058,2.070534,4,1,0,1,0,0,0,1,0,11,1
8,8,9,49.044395,2.090632,2,0,2,2,1,0,0,8,0,5,0
9,9,15,49.038874,2.076405,5,1,0,0,0,0,0,2,0,6,6


Provide model output and an interpretation of the results. 

In [54]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

X = grouped_df[['Latitude', 'Longitude', 'Bar/Lounge', 'Business Services', 'Education', 'Entertainment',
        'Health Center', 'Home Services', 'Landmarks & Historical Buildings', 'Miscellaneous',
        'Other Services', 'Restaurant', 'Retail']]
y = grouped_df['Number of Bikes']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        Number of Bikes   R-squared:                       0.345
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     1.217
Date:                Thu, 21 Dec 2023   Prob (F-statistic):              0.316
Time:                        00:31:28   Log-Likelihood:                -99.270
No. Observations:                  44   AIC:                             226.5
Df Residuals:                      30   BIC:                             251.5
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                                       coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
const   

The R-squared value of 0.419 indicates that the model explains 41.9% of the variability in the Number of Bikes. However, the lower Adjusted R-squared value (0.139) suggests that not all included variables are contributing significantly to the model.

Among the independent varaibles, only Longitude, Landmarks and Historical buildings and restaurant have a P value less than 0.05 which would suggest that these variables may have a statistically significant relationship with the Number of Bikes. Variables with p-values greater than 0.05 may not be contributing significantly and could be considered for removal from the model. 

The latitude and longitude seem to have the highest coefficient value which would suggest they have an higher impact on the number of bikes 

# Stretch

How can you turn the regression model into a classification model?