# Joining Data with Pandas
## 1.0 Data merging basics
### 1.1 inner joins
Inner joins return rows that have matching values in both tables. By default, .merge in pandas performs an inner join hence there is no need to explicity type it. 

Inner join syntax:

df1_df2 = df1.merge(df2, on = "common_column", suffixes = ("_df1", "_df2"))


In [1]:
# load libraries 
import pandas as pd

### 1.2 Relationships
1. One to one relationship - Every row in the left table is related to one and only one row in the right table.

2. One to many relationship - Every row in the left table is related to one or more rows in the right table.


#### One to many merge
A business may have one or multiple owners. we have two dataframes: licences and business_owners. Merge the two tables and  find out what is the most common business owner title.


In [2]:
# load the dataframes
licenses = pd.read_pickle("licenses.p")
biz_owners = pd.read_pickle("business_owners.p")

In [3]:
# view the first few rows
licenses.head()

Unnamed: 0,account,ward,aid,business,address,zip
0,307071,3,743.0,REGGIE'S BAR & GRILL,2105 S STATE ST,60616
1,10,10,829.0,HONEYBEERS,13200 S HOUSTON AVE,60633
2,10002,14,775.0,CELINA DELI,5089 S ARCHER AVE,60632
3,10005,12,,KRAFT FOODS NORTH AMERICA,2005 W 43RD ST,60609
4,10044,44,638.0,NEYBOUR'S TAVERN & GRILLE,3651 N SOUTHPORT AVE,60613


In [4]:
# view the first few rows
biz_owners.head()

Unnamed: 0,account,first_name,last_name,title
0,10,PEARL,SHERMAN,PRESIDENT
1,10,PEARL,SHERMAN,SECRETARY
2,10002,WALTER,MROZEK,PARTNER
3,10002,CELINA,BYRDAK,PARTNER
4,10005,IRENE,ROSENFELD,PRESIDENT


In [5]:
# Merge the licenses and biz_owners table on account
licenses_owners = licenses.merge(biz_owners, on="account")

# Group the results by title then count the number of accounts
counted_df = licenses_owners.groupby("title").agg({"account":"count"})

# Sort the counted_df in desending order
sorted_df = counted_df.sort_values(by="account", ascending=False)

# Use .head() method to print the first few rows of sorted_df
sorted_df.head()


Unnamed: 0_level_0,account
title,Unnamed: 1_level_1
PRESIDENT,6259
SECRETARY,5205
SOLE PROPRIETOR,1658
OTHER,1200
VICE PRESIDENT,970


#### Merging multiple dataframes

df1.merge(df2, on = "col").merge(df3, on = "col").merge(df4, on = "col")



Find the total number of rides provided to passengers passing through the Wilson station (station_name == 'Wilson') when riding Chicago's public transportation system on weekdays (day_type == 'Weekday') in July (month == 7). 

In [6]:
# load the dataFrames
cal = pd.read_pickle("cta_calendar.p")
ridership = pd.read_pickle("cta_ridership.p")
stations = pd.read_pickle("stations.p")

In [7]:
# view the data frames
cal.head()

Unnamed: 0,year,month,day,day_type
0,2019,1,1,Sunday/Holiday
1,2019,1,2,Weekday
2,2019,1,3,Weekday
3,2019,1,4,Weekday
4,2019,1,5,Saturday


In [8]:
ridership.head()

Unnamed: 0,station_id,year,month,day,rides
0,40010,2019,1,1,576
1,40010,2019,1,2,1457
2,40010,2019,1,3,1543
3,40010,2019,1,4,1621
4,40010,2019,1,5,719


In [9]:
stations.head()

Unnamed: 0,station_id,station_name,location
0,40010,Austin-Forest Park,"(41.870851, -87.776812)"
1,40020,Harlem-Lake,"(41.886848, -87.803176)"
2,40030,Pulaski-Lake,"(41.885412, -87.725404)"
3,40040,Quincy/Wells,"(41.878723, -87.63374)"
4,40050,Davis,"(42.04771, -87.683543)"


In [10]:
# Merge the ridership, cal, and stations tables
ridership_cal_stations = ridership.merge(cal, on=['year','month','day']) \
							.merge(stations, on='station_id')

# Create a filter to filter ridership_cal_stations
filter_criteria = ((ridership_cal_stations['month'] == 7) 
                   & (ridership_cal_stations['day_type'] == 'Weekday') 
                   & (ridership_cal_stations['station_name'] == 'Wilson'))

# Use .loc and the filter to select for rides
print(ridership_cal_stations.loc[filter_criteria, 'rides'].sum())


140005


#### One to many merge with multiple tables



assume that you are looking to start a business in the city of Chicago. Your perfect idea is to start a company that uses goats to mow the lawn for other businesses. However, you have to choose a location in the city to put your goat farm. You need a location with a great deal of space and relatively few businesses and people around to avoid complaints about the smell. You will need to merge three tables to help you choose your location. The land_use table has info on the percentage of vacant land by city ward. The census table has population by ward, and the licenses table lists businesses by ward.

In [11]:
# load the dataframes
land_use = pd.read_pickle("land_use.p")
licenses = pd.read_pickle("licenses.p")
census = pd.read_pickle("census.p")


In [12]:
# view the dataframe columns
land_use.columns

Index(['ward', 'residential', 'commercial', 'industrial', 'vacant', 'other'], dtype='object')

In [13]:
# view the dataframe columns
licenses.columns

Index(['account', 'ward', 'aid', 'business', 'address', 'zip'], dtype='object')

In [14]:
# view the dataframe columns
census.columns

Index(['ward', 'pop_2000', 'pop_2010', 'change', 'address', 'zip'], dtype='object')

In [15]:
# Merge land_use and census and merge result with licenses including suffixes
land_cen_lic = land_use.merge(census, on='ward') \
                    .merge(licenses, on='ward', suffixes=('_cen','_lic'))


# Group by ward, pop_2010, and vacant, then count the # of accounts
pop_vac_lic = land_cen_lic.groupby(['ward','pop_2010','vacant'], 
                                   as_index=False).agg({'account':'count'})


# Sort pop_vac_lic and print the results
# In sorting list at the top we need 
# space - (vacant) ascending = False
# few businesses - (account) = True
# low population - (pop_2010) ascendinng = True
sorted_pop_vac_lic = pop_vac_lic.sort_values(["vacant", "account", "pop_2010"], 
                                             ascending = [False, True, True])

# Print the top few rows of sorted_pop_vac_lic
print(sorted_pop_vac_lic.head())

   ward  pop_2010  vacant  account
47    7     51581      19       80
12   20     52372      15      123
1    10     51535      14      130
16   24     54909      13       98
7    16     51954      13      156
