# Merging and Combining data  

Sometimes we need to combine data from two or more dataframes.  That's colloquially known as a **merge** or a **join**. There are lots of ways to do this.  We do a couple but supply references to more at the end.  

Along the way we take an extended detour to review methods for **downloading** and **unzipping** compressed files.  The tools we use here have a broad range of other applications, including web scraping.  

Outline:  

* [MovieLens data](#movielens).  A collection of movies and individual ratings.  
* [Automate file download](#requests).  Use the requests package to get a zipped file, then other tools to unzip it and read in the contents.  
* [Merge movie names and ratings](#merge-movies).  Merge information from two dataframes with Pandas' `merge` function.  

**Note: requires internet access to run.**  

In [14]:
import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 

# these are new 
import requests, io             # internet and input tools  
import zipfile as zf            # zip file tools 
import shutil                   # file management tools 
import os                       # operating system tools (check files)

## MovieLens data 

The data comes as a zip file that contains several csv's.  We get the details from the README inside.  (It's written in Markdown, so it's easier to read if we use a browser to format it.  Or we could cut and paste into a Markdown cell in an IPython notebook.)  

The file descriptions are:  

* `ratings.csv`:  each line is an individual film rating with the rater and movie id's and the rating.  Order:  `userId, movieId, rating, timestamp`. 
* `tags.csv`:  each line is a tag on a specific film.  Order:  `userId, movieId, tag, timestamp`. 
* `movies.csv`:  each line is a movie name, its id, and its genre.  Order:  `movieId, title, genres`.  Multiple genres are separated by "pipes" `|`.   
* `links.csv`:  each line contains the movie id and corresponding id's at [IMBd](http://www.imdb.com/) and [TMDb](https://www.themoviedb.org/).  

The easy way to input this data is to download the zip file onto our computer, unzip it, and read the individual csv files using `read.csv()`.  But **anyone can do it the easy way**.  We want to automate this, so we can redo it without any manual steps.  This takes some effort, but once we have it down we can apply it to lots of other data sources.  

## Automate file download 

We're looking for an automated way, so that if we do this again, possibly with updated data, the whole process is in our code.  Automated data entry involves these steps: 

* Get the file.  We use the [requests](http://docs.python-requests.org/) package, which handles internet files and comes pre-installed with Anaconda. This kind of thing was hidden behind the scenes in the Pandas `read_csv` function, but here we need to do it for ourselves. The package authors add:  
>Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death.
* Convert to zip.   Requests simply loads whatever's at the given url. The [io](https://docs.python.org/3.5/library/io.html) module's `io.Bytes` reconstructs it as a file, here a zip file.  
* Unzip the file.  We use the [zipfile](https://docs.python.org/3.5/library/zipfile.html) module, which is part of core Python, to extract the files inside.   
* Read in the csv's.  Now that we've extracted the csv files, we use `read_csv` as usual.  

We found this [Stack Overflow exchange](http://stackoverflow.com/questions/23419322/download-a-zip-file-and-extract-it-in-memory-using-python3) helpful. 

**Digression.**  This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply `read_csv` to a url.  Here we grab whatever is at the url.  Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using `read_csv`.  It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.  

In [2]:
# get "response" from url 
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url) 

# describe response 
print('Response status code:', r.status_code)
print('Response type:', type(r))
print('Response .content:', type(r.content)) 
print('Response headers:\n', r.headers, sep='')

Response status code: 200
Response type: <class 'requests.models.Response'>
Response .content: <class 'bytes'>
Response headers:
{'ETag': '"e02fd-53f1536837040"', 'Accept-Ranges': 'bytes', 'Content-Type': 'application/zip', 'Last-Modified': 'Mon, 17 Oct 2016 20:13:45 GMT', 'Connection': 'Keep-Alive', 'Keep-Alive': 'timeout=5, max=100', 'Content-Length': '918269', 'Date': 'Mon, 13 Nov 2017 19:35:50 GMT', 'Server': 'Apache/2.4.18 (Ubuntu)'}


In [3]:
# convert bytes to zip file  
mlz = zf.ZipFile(io.BytesIO(r.content)) 
print('Type of zipfile object:', type(mlz))

Type of zipfile object: <class 'zipfile.ZipFile'>


In [4]:
# what's in the zip file?
mlz.namelist()

['ml-latest-small/',
 'ml-latest-small/links.csv',
 'ml-latest-small/movies.csv',
 'ml-latest-small/ratings.csv',
 'ml-latest-small/README.txt',
 'ml-latest-small/tags.csv']

In [5]:
# extract and read csv's
movies  = pd.read_csv(mlz.open(mlz.namelist()[2]))
ratings = pd.read_csv(mlz.open(mlz.namelist()[3]))

In [11]:
# Whip through the dataframes and look what is going on...
for df in [movies, ratings]:
    print('Type:', type(df))
    print('Dimensions:', df.shape)
    print('Variables:', list(df))
    print('First few rows', df.head(3), '\n')

Type: <class 'pandas.core.frame.DataFrame'>
Dimensions: (9125, 3)
Variables: ['movieId', 'title', 'genres']
First few rows    movieId                    title  \
0        1         Toy Story (1995)   
1        2           Jumanji (1995)   
2        3  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance   

Type: <class 'pandas.core.frame.DataFrame'>
Dimensions: (100004, 4)
Variables: ['userId', 'movieId', 'rating', 'timestamp']
First few rows    userId  movieId  rating   timestamp
0       1       31     2.5  1260759144
1       1     1029     3.0  1260759179
2       1     1061     3.0  1260759182 



**Exercise.** Something to do together. Suppose we wanted to save the files on our computer.  How would we do it? Would we prefer individual csv's or a single zip?

## Merging ratings and movie titles 

The movie ratings in the dataframe `ratings` give us individual opinions about movies, but they don't include the name of the movie.   Why not?  Rather than include the name every time a movie is rated, the MovieLens data associates each rating with a movie code, than stores the names of movies associatd with each movie code in the dataframe `movies`.  We run across this a lot:  some information is in one data table, other information is in another.  

Our **want** is therefore to add the movie name to the `ratings` dataframe.  We say we **merge** the two dataferames.  There are lots of ways to merge.  Here we do one as an illustration.  

Let's start by reminding ourselves what we have.  

### Merging

Here's roughly what's involved in what we're doing.  We take the `movieId` variable from `ratings` and look it up in `movies`.  When we find it, we look up the `title` and add it as a column in `ratings`.  The variable `movieId` is common, so we can use it to link the two dataframes.  

In [27]:
combo = pd.merge(ratings, movies,   # left and right df's
                 how='left',        # add to left 
                 on='movieId'       # link with this variable/column 
                ) 

print('Dimensions of ratings:', ratings.shape)
print('Dimensions of movies:', movies.shape)
print('Dimensions of new df:', combo.shape)

combo.head(50)

Dimensions of ratings: (100004, 4)
Dimensions of movies: (9125, 3)
Dimensions of new df: (100004, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,31,2.5,1260759144,Dangerous Minds (1995),Drama
1,1,1029,3.0,1260759179,Dumbo (1941),Animation|Children|Drama|Musical
2,1,1061,3.0,1260759182,Sleepers (1996),Thriller
3,1,1129,2.0,1260759185,Escape from New York (1981),Action|Adventure|Sci-Fi|Thriller
4,1,1172,4.0,1260759205,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama
5,1,1263,2.0,1260759151,"Deer Hunter, The (1978)",Drama|War
6,1,1287,2.0,1260759187,Ben-Hur (1959),Action|Adventure|Drama
7,1,1293,2.0,1260759148,Gandhi (1982),Drama
8,1,1339,3.5,1260759125,Dracula (Bram Stoker's Dracula) (1992),Fantasy|Horror|Romance|Thriller
9,1,1343,2.0,1260759131,Cape Fear (1991),Thriller


Now we can save this for work later on our local computer

In [15]:
# save as csv file for future use 
#combo.to_csv('mlcombined.csv')

print('Current directory:\n', os.getcwd(), sep='')
print('List of files:', os.listdir(), sep='\n')

Current directory:
C:\data_bootcamp\Data_Bootcamp_Fall_2017\data_bootcamp_1113
List of files:
['.ipynb_checkpoints', 'merge_iozip_data.ipynb', 'mlcombined.csv']


**Exercise.** Some of these we know how to do, the others we don't.  For the ones we know, what is the answer?  For the others, what (in loose terms) do we need to be able to do to come up with an answer?  

* What is the overall average rating?  
* What is the overall distribution of ratings?  
* What is the average rating of each movie?  
* How many ratings does each movie get? 