# Practice 4

In this exercise, you will practice aggregating and summarizing data with Pandas `groupby` and `pivot_table` and merging/joining datasets using Pandas `concat` and `merge`.

You can either print answers directly from your code or write them in the markdown cells below your code. Either way, make sure that your answers are visible and can be easily read in the final notebook you turn in.

## Part 1: Summarizing Data
In this part we will start by working with the `seaborn` `planets` dataset. [Seaborn](https://seaborn.pydata.org/index.html) is a library for data visualization in Python (already included in your Anaconda distribution) to which we will be returning to soon. For now, we are just using it for easy access to the `planets` dataset containing information about 1,035 extrasolar planets that have been discovered by astronomers over the last several years. As an aside, extrasolar planet discovery is an excellent example of how data science can help fuel discovery in the sciences by automating the analysis of large quantities of data, in this case from telescopes. If you're interested (not necessary to complete this assignment), you can read more at https://exoplanets.nasa.gov, from which this dataset was originally drawn.

To begin, run the following code to import the dataset into the `planets` DataFrame and preview the first five rows.

In [2]:
# Run but do not modify this code
import seaborn as sns
import pandas as pd

planets = sns.load_dataset('planets')
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


### Question 1
Use Pandas `groupby` operations to answer the following.

1. If you run the code `planets.groupby("number").count()` you will see different values for the different columns. Why is that?
2. What are the two `method`s that account for the most discoveries? How many discoveries were made with those `method`s?
3. In which years were more than 100 discoveries made?
4. Which `method` has found the most distant exoplanets on average (i.e., the `distance` column), and what is that average distance?
5. Which `method` has found the single most distant exoplanet in the dataset, and what `distance` is that exoplanet?

In [67]:
planets.groupby("number").count()
planets.groupby("method")['number'].sum()
planets.groupby("year")['number'].size()>100
planets.groupby("method")['distance'].mean()
planets.groupby("method")['distance'].max()



method
Astrometry                         20.77
Eclipse Timing Variations         500.00
Imaging                           165.00
Microlensing                     7720.00
Orbital Brightness Modulation    1180.00
Pulsar Timing                    1200.00
Pulsation Timing Variations          NaN
Radial Velocity                   354.00
Transit                          8500.00
Transit Timing Variations        2119.00
Name: distance, dtype: float64

### Answer 1
1. planets.groupby("number").count() returns different values for each column because some columns are missing more data than others.
2. Radial velocity: 952 and Transit: 776
3. 2010, 2011, 2012, 2013
4. Microlensing: 4144
5. Transit: 8500

### Question 2
Next we will work with the titanic dataset which contains historical information about the passengers of the cruiseship *Titanic* that sank in the North Atlantic in 1912. Import the dataset and preview the first few rows below.

In [4]:
# Run but do not modify this code
import seaborn as sns
import pandas as pd

titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Use Pandas `pivot_table`s to answer the following. Note: What you choose to be the rows or columns of the `pivot_table`s is up to you as long as you are showing the correct groupings and values. We do recommend doing each part in it's own cell, for nicer table formatting.

1. Show the average fare paid by passengers grouped by each combination of `sex` and `class`.
2. Show the number of passengers grouped by each combination of `class` and `embark_town`.
3. Show the fraction of passengers who survived (i.e., `survived==1`) grouped by each combination of `sex`, `class`, and `embark_town`. For example, if there are four individuals of a given `sex`, `class`, and `embark_town`, and three of the four survived, the value for that combination would be `0.75`.
4. Show the average `age` and the total `fare` paid by passengers grouped by each combination of `class` and `sex`. For example, if there were just 2 passengers with `class==first` and `sex==male` aged `20` and `30` and having paid `fare`s of `50` and `70` each, then the average age of that combination would be `25` and the total `fare` paid would be `120`.

In [69]:
# 1
print(titanic.pivot_table('fare',index='sex', columns='class'))
print("")
#2
print(titanic.pivot_table(index='embark_town',columns='class',aggfunc='size'))
print("")
#3
print(titanic.pivot_table('survived',[ 'sex','class'],'embark_town'))
print("")
#4
print(titanic.pivot_table(index='class', columns='sex',
                    aggfunc={'fare':sum, 'age':'mean'}))
#agg function use
# 3 agg funct survive mean at the end


class        First     Second      Third
sex                                     
female  106.125798  21.970121  16.118810
male     67.226127  19.741782  12.661633

class        First  Second  Third
embark_town                      
Cherbourg       85      17     66
Queenstown       2       3     72
Southampton    127     164    353

embark_town    Cherbourg  Queenstown  Southampton
sex    class                                     
female First    0.976744    1.000000     0.958333
       Second   1.000000    1.000000     0.910448
       Third    0.652174    0.727273     0.375000
male   First    0.404762    0.000000     0.354430
       Second   0.200000    0.000000     0.154639
       Third    0.232558    0.076923     0.128302

              age                  fare           
sex        female       male     female       male
class                                             
First   34.611765  41.281386  9975.8250  8201.5875
Second  28.722973  30.740707  1669.7292  2132.1125
Third   

## Part 2: Merging Data

We begin by studying four tips files included with this practice: `tips_Thur.csv`, `tips_Fri.csv`, `tips_Sat.csv`, and `tips_Sun.csv`. Each contains information about tips received by servers at a restaurant on the particular days of the week denoted by the file names (Thur for Thursday, Fri for Friday, Sat for Saturday, and Sun for Sunday). Below, we import and preview one of the datasets.

In [6]:
# Run but do not modify this code
import pandas as pd
Thur = pd.read_csv("tips_Thur.csv")
Thur.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size
0,27.2,4.0,Male,No,Lunch,4
1,22.76,3.0,Male,No,Lunch,2
2,17.29,2.71,Male,No,Lunch,2
3,19.44,3.0,Male,Yes,Lunch,2
4,16.66,3.4,Male,No,Lunch,2


### Question 3
Answer the following questions using the four tips datasets. You will need to combine (using Pandas `concat`) the datasets to answer some of the questions. Furthermore, some of the questions will require information about the day, which is only contained in the file names (you are welcome to add additional columns to the datasets if you wish).  

1. What is the average overall `total_bill` across all four days?
2. For each of the four days, what is the average `tip` for that day?
3. Create a pivot table that shows the average ratio of `tip` to `total_bill` (e.g., if a `tip` is `4` and the `total_bill` is `20`, then the ratio would be `0.2`) grouped by `sex` and day. 

In [43]:
Fri = pd.read_csv("tips_Fri.csv")
Sat = pd.read_csv("tips_Sat.csv")
Sun = pd.read_csv("tips_Sun.csv")

Thur['day']=['Thur']*len(Thur)
Fri['day']=['Fri']*len(Fri)
Sat['day']=['Sat']*len(Sat)
Sun['day']=['Sun']*len(Sun)
fourdayarr=pd.concat([Thur,Sat,Fri,Sun])
#1
print(fourdayarr['total_bill'].mean())
#2
print(fourdayarr.groupby('day')['tip'].mean())
#3
fourdayarr['ratio']=fourdayarr['tip']/fourdayarr['total_bill']
fourdayarr.pivot_table('ratio',index='sex',columns='day')


19.785942622950824
day
Fri     2.734737
Sat     2.993103
Sun     3.255132
Thur    2.771452
Name: tip, dtype: float64


day,Fri,Sat,Sun,Thur
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,0.199388,0.15647,0.181569,0.157525
Male,0.143385,0.151577,0.162344,0.165276


### Answer 3
1. 19.785942622950824

### Question 4
In this question we will work with movie rating data in three different tables/dataframes. You will need to `merge` information from the different tables to answer the questions below. First we import and preview the tables.

In [44]:
# Run but do not modify this code
import pandas as pd
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,age,sex,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


In [45]:
# Run but do not modify this code
ratings = pd.read_csv("ratings.csv")
ratings.head()

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [46]:
# Run but do not modify this code
movies = pd.read_csv("movies.csv")
movies.head()

Unnamed: 0,movie_id,movie_title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


Answer the following. 

1. How many movies have been rated at least 100 times?
2. Which five users have given the highest average ratings? List their `user_id`s and their average `rating`s.
3. Create a pivot table that displays average `rating`s grouped by `sex` and `occupation`.

In [140]:
# Put your code to answer the question here
#1
rated=pd.merge(ratings,movies)
tolist=rated.groupby('movie_title')['movie_id'].size()
print(len(tolist[tolist >= 100].index.tolist()))


338


In [141]:

#2
usersratings=pd.merge(rated,users)
topuser=(usersratings.groupby('user_id')['rating'].mean()).sort_values(ascending=False)
print(topuser)

user_id
849    4.869565
688    4.833333
507    4.724138
628    4.703704
928    4.687500
         ...   
774    2.058036
685    2.050000
445    1.985185
405    1.834464
181    1.491954
Name: rating, Length: 943, dtype: float64


In [142]:

#3
usersratings['avgrating']=usersratings.groupby('user_id')['rating'].mean()
usersratings.pivot_table(index='occupation', columns='sex',
                    aggfunc={'rating':'mean'})

Unnamed: 0_level_0,rating,rating
sex,F,M
occupation,Unnamed: 1_level_2,Unnamed: 2_level_2
administrator,3.781839,3.555233
artist,3.347065,3.875841
doctor,,3.688889
educator,3.698857,3.660246
engineer,3.751724,3.537609
entertainment,3.448889,3.440107
executive,3.773756,3.31961
healthcare,2.736021,3.639839
homemaker,3.27881,3.5
lawyer,3.623188,3.741379


### Answer 4


### Question 5
We will work with the `restaurants_a.csv` and `restaurants_b.csv` datasets for this question. Each contain five columns: `id` (a numeric index serving as a unique id, not correlated across the datasets), `name` (of the restaurant), `address` (the street address), `city`, and `type` (the type of restaurant). First we import and preview the data.

In [94]:
# Run but do not modify this code
df_a = pd.read_csv("restaurants_A.csv")
df_a.head()

Unnamed: 0,id,name,address,city,type
0,0,belvedere the,9882 little santa monica blvd.,beverly hills,pacific new wave
1,1,triangolo,345 e. 83rd st.,new york,italian
2,2,broadway deli,3rd st. promenade,santa monica,american
3,3,lettuce souprise you (at),3525 mall blvd.,duluth,cafeterias
4,4,otabe,68 e. 56th st.,new york,asian


In [95]:
# Run but do not modify this code
df_b = pd.read_csv("restaurants_B.csv")
df_b.head()

Unnamed: 0,id,name,address,city,type
0,22,indigo coastal grill,1397 n. highland ave.,atlanta,eclectic
1,54,aqua,252 california st.,san francisco,american (new)
2,89,boulevard,1 mission st.,san francisco,american (new)
3,150,khan toke thai house,5937 geary blvd.,san francisco,thai
4,151,bacchanalia,3125 piedmont rd. near peachtree rd.,atlanta,international


Some, but not all, of the restaurants in the two datasets are actually the same. In this question, we would like to consider the problem of merging the datasets. Unfortunately, the `id`s do not correspond between the datasets, so there is no obvious primary key to merge on. In this question, you will explore a fuzzy matching to link the records between the two datasets. You will be asked to use the `edit_dist` function, but you do not need to implement it. An implementation is provided for you in `edit_distance.py`, and you can simply import the function below. It takes two strings as input and returns the edit distance between them.

In [96]:
# Run but do not modify this code
from edit_distance import edit_dist

# Example of using the edit_dist function
print(edit_dist("hello", "hallo!"))

2


Answer the following.

1. First, try to perform an inner merge (the default for Pandas `merge`) on the two datasets on the `name` column. How many rows are in the resulting merged dataset? Why is this value much smaller than the sizes of `df_a` and `df_b`?
2. Next, try to perform an inner merge on the two datasets on the `city` column. How many rows are in the resulting merged dataset? Why is this value much larger than the sizes of `df_a` and `df_b`?
3. Print the names of all pairs of records (one from `df_a` and the other from `df_b`) such that the two names have edit distance of 1 or 2 (note that if two strings have edit distance 0, they are exactly the same; you do not need to print these). It is fine to use `for` loops to solve this and your code may take a second or two to run.
4. Among the names you identified in step 3, which pairs do you think are actually mispellings, and which do you think might actually be different restaurants? Explain your answer using information from other columns beside `name`.

In [135]:
#1
print(len(df_a['name']))
print(len(df_b['name']))
print(len(pd.merge(df_a['name'], df_b['name'], how='inner')))
#2
print(len(df_a))
print(len(df_b))
print(len(pd.merge(df_a['city'], df_b['city'], how='inner')))

186
181
43
186
181
4887


In [128]:
# 3
arr1=df_a['name']
arr2=df_b['name']
arr3=[]
for i in arr1:
    for j in arr2:
        distance=edit_dist(i, j)
        if(distance>0 and distance<3):
            print(i + " " + j)
            arr3.append(i)
            arr3.append(j)

l'orangerie l orangerie
indigo coast grill indigo coastal grill
boulavard boulevard
drago spago
felidia filidia
march marichu
mesa grill sea grill
uncle nicks uncle nick's


In [127]:
#4
lst2df=pd.DataFrame(arr3,columns=['name'])

print(pd.merge(df_a,lst2df))
print(pd.merge(df_b,lst2df))

    id                name                                   address  \
0   11         l'orangerie                   903 n. la cienega blvd.   
1   12  indigo coast grill                     1397 n. highland ave.   
2   62           boulavard                             1 mission st.   
3  124               drago                       2628 wilshire blvd.   
4  263             felidia                           243 e. 58th st.   
5  285               march                           405 e. 58th st.   
6  405          mesa grill                            102 fifth ave.   
7  561         uncle nicks  747 9th ave.  between 50th and 51st sts.   

            city              type  
0   w. hollywood  french (classic)  
1        atlanta         caribbean  
2  san francisco          american  
3   santa monica           italian  
4  new york city           italian  
5  new york city    american (new)  
6  new york city      southwestern  
7       new york     mediterranean  
    id            

### Answer 5
1. It's much smaller because inner takes the intersection of the two data sets, and in the given data sets only some of the names match.
2. There will be k * m where k is the number of rows for the duplicated value in df_a and m is the number of rows with the duplicated value value in df_b, and this number is larger due to there being so many duplicates in both the city columns.
3. l'orangerie l orangerie
indigo coast grill indigo coastal grill
boulavard boulevard
drago spago
felidia filidia
march marichu
mesa grill sea grill
uncle nicks uncle nick's
4. mesa grill sea grill, drago spago, and march marichu are probably distinct pairs because they don't seem like common mispellings of each other and the addresses for those restaurants are different.