# Brainstorming

In [1]:
import pandas as pd

<hr style="border:2px solid black"> </hr>

# Chicago's 'L' Train

## Target: "riders"

### Hypothesis: 
- There is no relationship between number of riders and station

### Project Goals: 
- to create a machine learning model that will accurately predict the number of riders per station

### Important Question:
- do less affluent areas use mass transit more frequently (socioeconomics)?
- does date play a roll in number of riders?

### Additional Info:
- large dataset (1.05M entries)
- over 20 years worth of info
- fairly clean data

- **Resume builder**: there are likely employers interested in predicting number of customers

In [2]:
df = pd.read_csv('CTA_totals.csv')

In [3]:
df.head()

Unnamed: 0,station_id,stationname,date,daytype,rides
0,41280,Jefferson Park,12/22/2017,W,6104
1,41000,Cermak-Chinatown,12/18/2017,W,3636
2,40280,Central-Lake,12/02/2017,A,1270
3,40140,Dempster-Skokie,12/19/2017,W,1759
4,40690,Dempster,12/03/2017,U,499


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1049633 entries, 0 to 1049632
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   station_id   1049633 non-null  int64 
 1   stationname  1049633 non-null  object
 2   date         1049633 non-null  object
 3   daytype      1049633 non-null  object
 4   rides        1049633 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 40.0+ MB


In [5]:
df.shape

(1049633, 5)

In [6]:
df.isnull().sum()

station_id     0
stationname    0
date           0
daytype        0
rides          0
dtype: int64

In [21]:
df.rides.max(), df.rides.min()

(36323, 0)

<hr style="border:2px solid black"> </hr>

# Food Inspection Scores

## Target Variable: "Score"

### Hypothesis: 
- There is no relationship between food inspection scores and location

### Project Goals: 
- to create a machine learning model that will accurately predict the food inspection score of a restaurant

### Important Questions:
- do food inspection scores depend on location of establishment?
- do food inspection scores depend on type of inspection?
- do food inspection scores depend on date?

### Additional Info:
- will need to clean data somewhat (change dtypes, drop columns)
    - Inspection date to datetime
    - zip code to int
    
- **Resume Builder**: any business can use predictive modeling to improve customer satisfaction thus increasing revenue    


In [7]:
df3 = pd.read_csv('Food_Scores.csv')

In [8]:
df3.head()

Unnamed: 0,Restaurant Name,Zip Code,Inspection Date,Score,Address,Facility ID,Process Description
0,Comfort Suites,78744.0,04/20/2021,99,"5001 S IH\nAUSTIN, TX 78744",11905043,Routine Inspection
1,Crazy Fruits # 3,78719.0,06/03/2021,80,"5611 S US 183 HWY\nAUSTIN, TX 78719",12154211,Routine Inspection
2,Mi Casita,78725.0,05/19/2021,90,"9809 FM 969 RD\nAUSTIN, TX 78725",11633987,Routine Inspection
3,SV-Nala's,78735.0,06/01/2021,73,"4894 W US 290 HWY\nSUNSET VALLEY, TX 78735",11994463,Routine Inspection
4,7-Eleven 36559H,78734.0,03/08/2019,89,"3636 N FM 620 RD\nAUSTIN, TX 78734",10874261,Routine Inspection


In [9]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25011 entries, 0 to 25010
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Restaurant Name      25011 non-null  object 
 1   Zip Code             25006 non-null  float64
 2   Inspection Date      25011 non-null  object 
 3   Score                25011 non-null  int64  
 4   Address              25011 non-null  object 
 5   Facility ID          25011 non-null  int64  
 6   Process Description  25011 non-null  object 
dtypes: float64(1), int64(2), object(4)
memory usage: 1.3+ MB


In [23]:
df3.shape

(25011, 7)

In [11]:
df3.isnull().sum()

Restaurant Name        0
Zip Code               5
Inspection Date        0
Score                  0
Address                0
Facility ID            0
Process Description    0
dtype: int64

In [26]:
df3.Score.max(), df3.Score.min()

(100, 44)

<hr style="border:2px solid black"> </hr>

### Predicting Alcohol Sales by County

In [12]:
df2 = pd.read_csv('mixed_bev.csv')

In [13]:
df2.head()

Unnamed: 0,Taxpayer Number,Taxpayer Name,Taxpayer Address,Taxpayer City,Taxpayer State,Taxpayer Zip,Taxpayer County,Location Number,Location Name,Location Address,...,Inside/Outside City Limits,TABC Permit Number,Responsibility Begin Date,Responsibility End Date,Obligation End Date,Liquor Receipts,Wine Receipts,Beer Receipts,Cover Charge Receipts,Total Receipts
0,32047970895,HONDURAS MAYA CAFE & BAR LLC,8011 HAZEN ST,HOUSTON,TX,77036.0,101,1,HONDURAS MAYA CAFE & BAR LLC,5945 BELLAIRE BLVD STE B,...,Y,MB817033,08/16/2012,09/12/2019,07/31/2019,0,0,0,0,0
1,32049923835,"MERMAID KARAOKE PRIVATE CLUB, INC.",2639 WALNUT HILL LN STE 225,DALLAS,TX,75229.0,57,1,MERMAID KARAOKE PRIVATE CLUB,1310 W CAMPBELL RD STE 103,...,Y,N 837378,04/12/2013,07/01/2015,08/31/2014,480,185,1374,0,2039
2,32034036304,FENG KAI CORPORATION,8427 BOULEVARD 26,N RICHLND HLS,TX,76180.0,220,1,JAPANESE GRILL,8427 BOULEVARD 26,...,Y,MB576670,05/01/2008,03/17/2018,06/30/2016,1143,167,669,0,1979
3,14537211071,"THE HUTTO SMITHS, LLC",429 LITTLE LAKE RD,HUTTO,TX,78634.0,246,1,THE DOWNTOWN HALL OF FAME,205 EAST ST,...,Y,MB791778,12/06/2011,,03/31/2018,12881,357,10447,0,23685
4,32019999229,"THE CROSSING AT FIDDLE CREEK, INC.",1620 W CEDAR ST,STEPHENVILLE,TX,76401.0,72,1,THE CROSSING AT FIDDLE CREEK INC,2004 W SWAN ST,...,Y,N 643163,10/26/2006,11/30/2013,08/31/2008,4841,2413,4620,0,11874


- zip need to be int
- remove name
- add state to nulls

In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2556136 entries, 0 to 2556135
Data columns (total 24 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   Taxpayer Number             int64  
 1   Taxpayer Name               object 
 2   Taxpayer Address            object 
 3   Taxpayer City               object 
 4   Taxpayer State              object 
 5   Taxpayer Zip                float64
 6   Taxpayer County             int64  
 7   Location Number             int64  
 8   Location Name               object 
 9   Location Address            object 
 10  Location City               object 
 11  Location State              object 
 12  Location Zip                int64  
 13  Location County             int64  
 14  Inside/Outside City Limits  object 
 15  TABC Permit Number          object 
 16  Responsibility Begin Date   object 
 17  Responsibility End Date     object 
 18  Obligation End Date         object 
 19  Liquor Receipts      

In [15]:
df.shape

(1049633, 5)

In [16]:
df2.isnull().sum()

Taxpayer Number                     0
Taxpayer Name                       0
Taxpayer Address                    0
Taxpayer City                       0
Taxpayer State                    884
Taxpayer Zip                      884
Taxpayer County                     0
Location Number                     0
Location Name                       0
Location Address                    0
Location City                       0
Location State                      0
Location Zip                        0
Location County                     0
Inside/Outside City Limits          0
TABC Permit Number                  0
Responsibility Begin Date           0
Responsibility End Date       1511837
Obligation End Date                 0
Liquor Receipts                     0
Wine Receipts                       0
Beer Receipts                       0
Cover Charge Receipts               0
Total Receipts                      0
dtype: int64