# G2M Insight for Cab Investment Firm

In [25]:
import pandas as pd

In [26]:
cab_data = pd.read_csv("datasets/Cab_Data.csv")
city_data = pd.read_csv("datasets/City.csv")
customer_data = pd.read_csv("datasets/Customer_ID.csv")
transaction_data = pd.read_csv("datasets/Transaction_ID.csv")

## Data Profiling
#### Cab Dataset

In [27]:
print("CAB DATA INFO")
cab_data_info = cab_data.info()
print("\nCAB DATA HEAD")
cab_data_head = cab_data.head()
display(cab_data_head)
print("\nCAB DATA DESCRIBE")
cab_data_describe = cab_data.describe()
display(cab_data_describe)

CAB DATA INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 359392 entries, 0 to 359391
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Transaction ID  359392 non-null  int64  
 1   Date of Travel  359392 non-null  int64  
 2   Company         359392 non-null  object 
 3   City            359392 non-null  object 
 4   KM Travelled    359392 non-null  float64
 5   Price Charged   359392 non-null  float64
 6   Cost of Trip    359392 non-null  float64
dtypes: float64(3), int64(2), object(2)
memory usage: 19.2+ MB

CAB DATA HEAD


Unnamed: 0,Transaction ID,Date of Travel,Company,City,KM Travelled,Price Charged,Cost of Trip
0,10000011,42377,Pink Cab,ATLANTA GA,30.45,370.95,313.635
1,10000012,42375,Pink Cab,ATLANTA GA,28.62,358.52,334.854
2,10000013,42371,Pink Cab,ATLANTA GA,9.04,125.2,97.632
3,10000014,42376,Pink Cab,ATLANTA GA,33.17,377.4,351.602
4,10000015,42372,Pink Cab,ATLANTA GA,8.73,114.62,97.776



CAB DATA DESCRIBE


Unnamed: 0,Transaction ID,Date of Travel,KM Travelled,Price Charged,Cost of Trip
count,359392.0,359392.0,359392.0,359392.0,359392.0
mean,10220760.0,42964.067998,22.567254,423.443311,286.190113
std,126805.8,307.467197,12.233526,274.378911,157.993661
min,10000010.0,42371.0,1.9,15.6,19.0
25%,10110810.0,42697.0,12.0,206.4375,151.2
50%,10221040.0,42988.0,22.44,386.36,282.48
75%,10330940.0,43232.0,32.96,583.66,413.6832
max,10440110.0,43465.0,48.0,2048.03,691.2


High-level Observations and Insights On Cab Dataset
- The Date of Travel column is in Excel serial date format
    - May need conversion if pandas has no easy way to deal with these dates
- Future analysis topics from this table
    - (Price charged - cost of trip) to find **profitability** of each company
    - Compare KM traveled to date of travel/city to find trends
- No null values found

#### City Dataset

In [37]:
print("CITY DATA INFO")
city_data_info = city_data.info()
print("\nCITY DATA HEAD")
city_data_head = city_data.head()
display(city_data_head)
print("\nCITY DATA DESCRIBE")
city_data_describe = city_data.describe()
display(city_data_describe)

CITY DATA INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   City        20 non-null     object
 1   Population  20 non-null     object
 2   Users       20 non-null     object
dtypes: object(3)
memory usage: 608.0+ bytes

CITY DATA HEAD


Unnamed: 0,City,Population,Users
0,NEW YORK NY,8405837,302149
1,CHICAGO IL,1955130,164468
2,LOS ANGELES CA,1595037,144132
3,MIAMI FL,1339155,17675
4,SILICON VALLEY,1177609,27247



CITY DATA DESCRIBE


Unnamed: 0,City,Population,Users
count,20,20,20
unique,20,20,20
top,NEW YORK NY,8405837,302149
freq,1,1,1


High-level Observations and Insights On City Dataset
- The population and users columns are not integers
    - This will hinder future analysis and should be dealt with in cleaning phase
- This dataset in combination with the number of cab users can be used to find **market penetration**
    - This can also allude to adoption rate of new cabs
- This dataset in combination with cab_data can be used to see profits by city
- No null values found

#### Customer Dataset

In [38]:
print("CUSTOMER DATA INFO")
customer_data_info = customer_data.info()
print("\nCUSTOMER DATA HEAD")
customer_data_head = customer_data.head()
display(customer_data_head)
print("\nCUSTOMER DATA DESCRIBE")
customer_data_describe = customer_data.describe()
display(customer_data_describe)

CUSTOMER DATA INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49171 entries, 0 to 49170
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Customer ID         49171 non-null  int64 
 1   Gender              49171 non-null  object
 2   Age                 49171 non-null  int64 
 3   Income (USD/Month)  49171 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB

CUSTOMER DATA HEAD


Unnamed: 0,Customer ID,Gender,Age,Income (USD/Month)
0,29290,Male,28,10813
1,27703,Male,27,9237
2,28712,Male,53,11242
3,28020,Male,23,23327
4,27182,Male,33,8536



CUSTOMER DATA DESCRIBE


Unnamed: 0,Customer ID,Age,Income (USD/Month)
count,49171.0,49171.0,49171.0
mean,28398.252283,35.363121,15015.631856
std,17714.137333,12.599066,8002.208253
min,1.0,18.0,2000.0
25%,12654.5,25.0,8289.5
50%,27631.0,33.0,14656.0
75%,43284.5,42.0,21035.0
max,60000.0,65.0,35000.0


High-level Observations and Insights On Customer Dataset
- Income is in USD/Month rather than a yearly value
- Income joined with transaction table can give insight into which demographic uses cabs the most
- Gender/Age in combination with other datasets
    - See gender distribution of both companies
    - See age distribution of both companies
- No null values found

#### Transaction Dataset

In [39]:
print("TRANSACTION DATA INFO")
transaction_data_info = transaction_data.info()
print("\nTRANSACTION DATA HEAD")
transaction_data_head = transaction_data.head()
display(transaction_data_head)
print("\nTRANSACTION DATA DESCRIBE")
transaction_data_describe = transaction_data.describe()
display(transaction_data_describe)

TRANSACTION DATA INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440098 entries, 0 to 440097
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Transaction ID  440098 non-null  int64 
 1   Customer ID     440098 non-null  int64 
 2   Payment_Mode    440098 non-null  object
dtypes: int64(2), object(1)
memory usage: 10.1+ MB

TRANSACTION DATA HEAD


Unnamed: 0,Transaction ID,Customer ID,Payment_Mode
0,10000011,29290,Card
1,10000012,27703,Card
2,10000013,28712,Cash
3,10000014,28020,Cash
4,10000015,27182,Card



TRANSACTION DATA DESCRIBE


Unnamed: 0,Transaction ID,Customer ID
count,440098.0,440098.0
mean,10220060.0,23619.51312
std,127045.5,21195.549816
min,10000010.0,1.0
25%,10110040.0,3530.0
50%,10220060.0,15168.0
75%,10330080.0,43884.0
max,10440110.0,60000.0


High-level Observations and Insights On Transaction Dataset\
- Payment_Mode column can be used to find the preferred type of payment
    - Can be expanded with other datasets to find if the preference 
    has trends with age, gender, etc.
- No null values found