## Data Wrangling with pandas

Data preparation is often the most important part of data analysis. Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for data processing. Fortunately, pandas along with the Python standard library provide you with a high level, flexible, and high-performance set of core manipulations and algorithms to enable you to wrangle data into the right form without much trouble.


### MovieLens 1M Data Set

GroupLens Research provides a number of [collections of movie ratings data](http://grouplens.org/datasets/movielens/) collected from users of MovieLens. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender, and occupation).

Download data set [The MovieLens 1M Data Set](http://files.grouplens.org/datasets/movielens/ml-1m.zip) (ml-1m.zip, 5.64MB). It contains about 1M+ ratings collected from 6K+ users on about 4K movies (check exact numbers!). It's spread across 3 tables: ratings, user information, and movie information. 

### ZIP codes

Also download <a href="ftp://ftp.census.gov/econ2013/CBP_CSV/zbp13totals.zip">Complete ZIP Code Totals</a> (725KB) at The Census Bureau's website.

Create a folder called "movies", copy both downloaded files in, and unpack them.

In [2]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_rows', 15)

## Files

Reading files and writing to files are two very common operations while working with data. Most often you will be using read_csv and to_csv functions, as you have already been doing during the introductory pandas lecture. These two functions can operate with different type of delimiter-separated files, which is the prevailing format among data-based files. At the end of the notebook we will also learn how to read and write excel files. 

### Reading .csv files

* text files, values are comma-separated (csv = comma-separated files)
* standard format for storing tabular data
* values in row delimeted usually with ',' (but sometimes other characters, like a tab)
* Use quotes "", when you wish to hide delimiters in text. 
* methods [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [`to_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html).

In [3]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
mnames = ['movie_id', 'title', 'genres']

users = pd.read_csv('movies/ml-1m/users.dat', sep='::', header=None, engine='python', encoding='latin1', names=unames)
ratings = pd.read_csv('movies/ml-1m/ratings.dat', sep='::', header=None, names=rnames, encoding='latin1', engine='python')
movies = pd.read_csv('movies/ml-1m/movies.dat', sep='::', header=None, names=mnames, encoding='latin1', engine='python')

Variables `users`, `ratings`, `movies` are dataframes. Remember, the main pandas structure is dataframe, which contains table data (two dimensional). 

In [4]:
# Checking the dimensions of the data frames gives as an immediate answer about the number of users, ratings, and movies.
users.shape, ratings.shape, movies.shape

((6040, 5), (1000209, 4), (3883, 3))

We can verify that everything loaded correctly by looking at the first/last few rows of each data frame.

In [5]:
# show head
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [6]:
ratings.tail()

Unnamed: 0,user_id,movie_id,rating,timestamp
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
1000208,6040,1097,4,956715569


In [7]:
movies.sample(n=6)

Unnamed: 0,movie_id,title,genres
2065,2134,Weird Science (1985),Comedy
940,952,Around the World in 80 Days (1956),Adventure|Comedy
1103,1119,Drunks (1997),Drama
3184,3253,Wayne's World (1992),Comedy
885,897,For Whom the Bell Tolls (1943),Adventure|War
1528,1568,MURDER and murder (1996),Crime|Drama|Mystery


In [8]:
# using Python's slice index produces the same result as "movies.head()"
movies.loc[:7]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's


### Merging data

It is much easier to work with all of the data merged together into a single table.

Data contained in pandas objects can be combined together in a number of built-in ways. These two are the most common:

- [pandas.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database *join* operations.
- [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) glues or stacks together objects along an axis.

Read more: [Merge, join, and concatenate](http://pandas.pydata.org/pandas-docs/stable/merging.html).

We will now combine `users`, `ratings`, and `movies` dataframes. We will have to use `merge`, because we are not just concatenating, but combining them relatively to some column.

We first merge ratings with users... (the common key is *user_id*)

In [9]:
ratings.head(1)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760


In [10]:
users.head(1)

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067


In [11]:
# merge ratings with users
df = pd.merge(ratings, users, on='user_id')

In [12]:
df.sample(n=5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip
982150,5926,1304,4,957278154,F,50,7,65712
881177,5323,913,5,960847356,M,25,20,94122
523079,3224,86,4,968518439,F,25,14,93428
748369,4468,902,3,965067745,F,35,7,75601
65126,438,1425,4,976675122,M,18,11,53705


... and then we merge the resulting DataFrame with movies data... (the common key is *movie_id*)

In [13]:
movies.head(1)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy


In [14]:
# merge df with movies
df = pd.merge(df, movies, on='movie_id')

In [15]:
df.sample(n=5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
535566,3305,318,5,968002293,M,56,20,94708,"Shawshank Redemption, The (1994)",Drama
384417,2246,1263,4,974596866,M,25,0,60622,"Deer Hunter, The (1978)",Drama|War
837413,5035,3114,5,962552101,F,25,17,19810,Toy Story 2 (1999),Animation|Children's|Comedy
922829,5573,2132,4,959300561,F,35,1,14619,Who's Afraid of Virginia Woolf? (1966),Drama
784484,4682,858,4,964629646,M,25,7,5346,"Godfather, The (1972)",Action|Crime|Drama


### Data Transformation

Filtering, cleaning and other transformations are important operations in data preparation. Removing duplicates, handling missing values, replacing values, mapping values are examples of such operations.

---
The values in the column *age* are not as expected. It's time to read more about the data in the README file.

USERS FILE DESCRIPTION

Age is chosen from the following ranges:

     1:  "Under 18"
    18:  "18-24"
    25:  "25-34"
    35:  "35-44"
    45:  "45-49"
    50:  "50-55"
    56:  "56+"
---

In [16]:
# Check which values are present in the data frame.
df['age'].unique()

array([ 1, 56, 25, 45, 50, 35, 18])

In [17]:
# Create a Python dictionary (<key>-<value> pairs) in order to replace <keys> with <values>.
age_group = {
    1:  "Under 18",
    18:  "18-24",
    25:  "25-34",
    35:  "35-44",
    45:  "45-49",
    50:  "50-55",
    56:  "56+",
}

In [18]:
age_group

{1: 'Under 18',
 18: '18-24',
 25: '25-34',
 35: '35-44',
 45: '45-49',
 50: '50-55',
 56: '56+'}

In [19]:
df = df.replace({'age':age_group})

In [20]:
df.sample(n=5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
614009,3718,1208,3,966263923,M,50-55,17,7728,Apocalypse Now (1979),Drama|War
589061,3592,272,4,966645543,F,56+,2,99701,"Madness of King George, The (1994)",Drama
338498,1990,1250,5,974682514,M,25-34,7,85257,"Bridge on the River Kwai, The (1957)",Drama|War
62846,424,3667,3,1016724619,M,25-34,17,55112,Rent-A-Cop (1988),Action|Comedy
865597,5220,1327,4,961551533,M,25-34,7,91436,"Amityville Horror, The (1979)",Horror


The values in the column *occupation* are not as expected either. Let's see the README file once again.

USERS FILE DESCRIPTION

Occupation is chosen from the following choices:

     0:  "other" or not specified
     1:  "academic/educator"
     2:  "artist"
     3:  "clerical/admin"
     4:  "college/grad student"
     5:  "customer service"
     6:  "doctor/health care"
     7:  "executive/managerial"
     8:  "farmer"
     9:  "homemaker"
    10:  "K-12 student"
    11:  "lawyer"
    12:  "programmer"
    13:  "retired"
    14:  "sales/marketing"
    15:  "scientist"
    16:  "self-employed"
    17:  "technician/engineer"
    18:  "tradesman/craftsman"
    19:  "unemployed"
    20:  "writer"


In [21]:
# Check whether all these values are present in the data frame.

In [22]:
occupation = {
    0:  "other",
    1:  "academic/educator",
    2:  "artist",
    3:  "clerical/admin",
    4:  "college/grad student",
    5:  "customer service",
    6:  "doctor/health care",
    7:  "executive/managerial",
    8:  "farmer",
    9:  "homemaker",
    10:  "K-12 student",
    11:  "lawyer",
    12:  "programmer",
    13:  "retired",
    14:  "sales/marketing",
    15:  "scientist",
    16:  "self-employed",
    17:  "technician/engineer",
    18:  "tradesman/craftsman",
    19:  "unemployed",
    20:  "writer",
}

In [23]:
df = df.replace({'occupation':occupation})

In [24]:
df.sample(n=5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
767605,4574,745,5,973314480,M,25-34,college/grad student,60614,"Close Shave, A (1995)",Animation|Comedy|Thriller
947401,5720,2005,2,958578421,M,25-34,other,60610,"Goonies, The (1985)",Adventure|Children's|Fantasy
536216,3308,1009,2,967975269,F,18-24,writer,15701-1348,Escape to Witch Mountain (1975),Adventure|Children's|Fantasy
549530,3390,1955,4,967512147,F,45-49,writer,02476,Kramer Vs. Kramer (1979),Drama
947506,5722,1809,4,958502194,M,25-34,writer,48103,Hana-bi (1997),Comedy|Crime|Drama


In [25]:
df.sample(n=3).T

Unnamed: 0,303339,316086,513405
user_id,1802,1883,3167
movie_id,2901,1296,2997
rating,3,3,5
timestamp,974759310,974876148,968817145
gender,M,F,M
age,18-24,50-55,25-34
occupation,college/grad student,other,artist
zip,22903,94521,77056
title,Phantasm (1979),"Room with a View, A (1986)",Being John Malkovich (1999)
genres,Horror|Sci-Fi,Drama|Romance,Comedy


### ZIP codes

ZIP codes are useful, however, we are not interested in so much details: it would be more informative for us to know from which US states the users come from. This information is contained in the ZIP codes.

First check the type of 'zip' column in the current data frame: is it a number or a string?

In [26]:
df.dtypes

user_id        int64
movie_id       int64
rating         int64
timestamp      int64
gender        object
age           object
occupation    object
zip           object
title         object
genres        object
dtype: object

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   user_id     1000209 non-null  int64 
 1   movie_id    1000209 non-null  int64 
 2   rating      1000209 non-null  int64 
 3   timestamp   1000209 non-null  int64 
 4   gender      1000209 non-null  object
 5   age         1000209 non-null  object
 6   occupation  1000209 non-null  object
 7   zip         1000209 non-null  object
 8   title       1000209 non-null  object
 9   genres      1000209 non-null  object
dtypes: int64(4), object(6)
memory usage: 76.3+ MB


In [28]:
df.loc[df['zip'].str.len()!=5].sample(n=3).T

Unnamed: 0,541610,21922,170057
user_id,3333,161,1081
movie_id,2628,920,1556
rating,4,4,1
timestamp,967846796,977218485,974951045
gender,M,M,M
age,18-24,45-49,18-24
occupation,self-employed,self-employed,college/grad student
zip,29404-2205,98107-2117,68144-2410
title,Star Wars: Episode I - The Phantom Menace (1999),Gone with the Wind (1939),Speed 2: Cruise Control (1997)
genres,Action|Adventure|Fantasy|Sci-Fi,Drama|Romance|War,Action|Romance|Thriller


In [29]:
df.loc[df['zip'].str.len()!=5]['zip'].nunique()

79

Read the CSV file into DataFrame object. 

In [30]:
#zip_path = os.path.expanduser('movies/zbp13totals.txt')
zip_path = 'movies/zbp13totals.txt'
zip_codes = pd.read_csv(zip_path)

Alternative: specify format in read_csv.

In [31]:
zip_codes = pd.read_csv(zip_path, converters={"zip":str})

Both files have a common column 'zip', so we can merge them. The only column that we need from zip_codes is 'stabbr' (state abbreviation).

In [32]:
zip_codes.sample(n=5)

Unnamed: 0,zip,name,empflag,emp_nf,emp,qp1_nf,qp1,ap_nf,ap,est,city,stabbr,cty_name
28431,71426,"FISHER, LA",A,S,0,S,0,H,501,3,FISHER,LA,SABINE
31934,79084,"STRATFORD, TX",,H,411,S,0,H,21153,55,STRATFORD,TX,SHERMAN
37515,97216,"PORTLAND, OR",,G,8104,H,67873,H,298308,489,PORTLAND,OR,MULTNOMAH
10246,28076,"HENRIETTA, NC",C,S,0,S,0,H,1418,8,HENRIETTA,NC,RUTHERFORD
6972,19519,"EARLVILLE, PA",A,S,0,H,82,H,299,3,EARLVILLE,PA,BERKS


### More on merging data

The *how* argument to merge specifies how to determine which keys are to be included in the resulting table. Here is a summary of the how options and their SQL equivalent names:

Merge method | SQL Join Name    | Description
-------------|------------------|--------------
left         |LEFT OUTER JOIN   | Use keys from left frame only
right        |RIGHT OUTER JOIN  | Use keys from right frame only
outer        |FULL OUTER JOIN   | Use union of keys from both frames
inner        |INNER JOIN        | Use intersection of keys from both frames

Read more: [Merge, join, and concatenate](http://pandas.pydata.org/pandas-docs/stable/merging.html).

Nice vizualization of various types of joins (taken from Ravjot Singh's <a href="https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d">post</a>) at medium.com. <img src="https://miro.medium.com/max/700/1*9eH1_7VbTZPZd9jBiGIyNA.png">

In [33]:
zip_codes.head(1)

Unnamed: 0,zip,name,empflag,emp_nf,emp,qp1_nf,qp1,ap_nf,ap,est,city,stabbr,cty_name
0,501,"HOLTSVILLE, NY",A,D,0,D,0,D,0,2,HOLTSVILLE,NY,SUFFOLK


In [34]:
df.head(1)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,Under 18,K-12 student,48067,One Flew Over the Cuckoo's Nest (1975),Drama


In [35]:
# merge data with zip
df = pd.merge(df, zip_codes[['zip', 'stabbr']], on='zip', how='left')

In [36]:
df.sample(n=3).T

Unnamed: 0,67984,211504,314038
user_id,456,1285,1875
movie_id,3070,1080,2490
rating,4,5,3
timestamp,976300995,976320009,974701077
gender,M,M,M
age,35-44,35-44,35-44
occupation,other,college/grad student,programmer
zip,55105,98125,94107
title,Adventures of Buckaroo Bonzai Across the 8th D...,Monty Python's Life of Brian (1979),Payback (1999)
genres,Adventure|Comedy|Sci-Fi,Comedy,Action|Thriller


In [37]:
df['zip'].isna().sum()

np.int64(0)

In [38]:
# Let's rename 'stabbr' to 'state'.
df = df.rename(columns={'stabbr':'state'})

In [39]:
df.sample(n=3).T

Unnamed: 0,526614,108961,713329
user_id,3259,713,4277
movie_id,2093,1136,3557
rating,2,4,4
timestamp,968271477,975535490,965387846
gender,F,M,M
age,18-24,35-44,35-44
occupation,college/grad student,executive/managerial,self-employed
zip,95616,79912,98133
title,Return to Oz (1985),Monty Python and the Holy Grail (1974),Jennifer 8 (1992)
genres,Adventure|Children's|Fantasy|Sci-Fi,Comedy,Thriller


### Handling Missing Data

Missing data is common in most data analysis applications. In pandas, missing data is represented by the floating point value `NaN`.

One of the goals of pandas was to make working with missing data as painless as possible. The following table briefly introduces the most common methods for this purpose.


Command      | Description
-------------|--------------
dropna       | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna       | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnull       | Return like-type object containing boolean values indicating which values are missing.

Read more: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

In our case, we will simply remove the rows with false ZIP code.

In [40]:
# Are there any missing values in state variable?
df['state'].isna().sum()

np.int64(27031)

In [41]:
# Display some of such rows.
df.loc[df['state'].isna()].head(3).T

Unnamed: 0,7379,7380,7381
user_id,53,53,53
movie_id,2987,1248,1175
rating,5,5,5
timestamp,977949703,977952986,977975954
gender,M,M,M
age,25-34,25-34,25-34
occupation,other,other,other
zip,96931,96931,96931
title,Who Framed Roger Rabbit? (1988),Touch of Evil (1958),Delicatessen (1991)
genres,Adventure|Animation|Film-Noir,Crime|Film-Noir|Thriller,Comedy|Sci-Fi


In [42]:
df.loc[df['state'].isna()].shape

(27031, 11)

In [43]:
# Drop the rows with missing values.

In [44]:
df = df.dropna()

The data frame now only contains rows with correctly written ZIP codes.

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 973178 entries, 0 to 1000208
Data columns (total 11 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user_id     973178 non-null  int64 
 1   movie_id    973178 non-null  int64 
 2   rating      973178 non-null  int64 
 3   timestamp   973178 non-null  int64 
 4   gender      973178 non-null  object
 5   age         973178 non-null  object
 6   occupation  973178 non-null  object
 7   zip         973178 non-null  object
 8   title       973178 non-null  object
 9   genres      973178 non-null  object
 10  state       973178 non-null  object
dtypes: int64(4), object(7)
memory usage: 89.1+ MB


The index is not correct now, as we dropped some rows. We have to reset it.

In [46]:
df.tail(3).T

Unnamed: 0,1000206,1000207,1000208
user_id,6040,6040,6040
movie_id,562,1096,1097
rating,5,4,4
timestamp,956704746,956715648,956715569
gender,M,M,M
age,25-34,25-34,25-34
occupation,doctor/health care,doctor/health care,doctor/health care
zip,11106,11106,11106
title,Welcome to the Dollhouse (1995),Sophie's Choice (1982),E.T. the Extra-Terrestrial (1982)
genres,Comedy|Drama,Drama,Children's|Drama|Fantasy|Sci-Fi


In [47]:
df = df.reset_index(drop=True)

The index is correct now.

<h3>Transforming Data Using a Function or Mapping</h3>

For many data sets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame.

In the present case, we want to make the following transformations:
<ul>
    <li>Convert epoch times in 'timestamp' column into a readable (and usable) format.</li>
    <li>Convert each letter in 'genres' column to lower case.</li>
</ul>

Function `pd.to_datetime` is used to convert values into datetime object. Function contains many arguments used for date formatting, but we can simply call it with default values:

In [48]:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

In [49]:
df.head(3).T

Unnamed: 0,0,1,2
user_id,1,1,1
movie_id,1193,661,914
rating,5,3,3
timestamp,2000-12-31 22:12:40,2000-12-31 22:35:09,2000-12-31 22:32:48
gender,F,F,F
age,Under 18,Under 18,Under 18
occupation,K-12 student,K-12 student,K-12 student
zip,48067,48067,48067
title,One Flew Over the Cuckoo's Nest (1975),James and the Giant Peach (1996),My Fair Lady (1964)
genres,Drama,Animation|Children's|Musical,Musical|Romance


There are many combinations of genres.

In [50]:
df['genres'].nunique()

301

 Display only first 15.

In [51]:
df['genres'].unique()[:15]

array(['Drama', "Animation|Children's|Musical", 'Musical|Romance',
       "Animation|Children's|Comedy", 'Action|Adventure|Comedy|Romance',
       'Action|Adventure|Drama', 'Comedy|Drama',
       "Adventure|Children's|Drama|Musical", 'Musical', 'Comedy',
       "Animation|Children's", 'Comedy|Fantasy', 'Animation',
       'Comedy|Sci-Fi', 'Drama|War'], dtype=object)

Convert each letter in 'genres' to lower case.

In [52]:
df['genres'] = df['genres'].str.lower()

In [53]:
df['genres'].unique()[:15]

array(['drama', "animation|children's|musical", 'musical|romance',
       "animation|children's|comedy", 'action|adventure|comedy|romance',
       'action|adventure|drama', 'comedy|drama',
       "adventure|children's|drama|musical", 'musical', 'comedy',
       "animation|children's", 'comedy|fantasy', 'animation',
       'comedy|sci-fi', 'drama|war'], dtype=object)

### Basics of `groupby`

* Use groupby mechanics when you need to split the data along some dimension, do something with each split and then combine the results. 
* Examples:
    * aggregated statistics: group sums, mean, sizes, counts, etc.
    * group-specific transformations: standardization within groups, fillin missing values within groups with a value derived from each group, discarding group according to some group criteria (e.g. movies with less than 10 ratings), etc.

We will demonstrate how to compute average rating for each occupation.

In [54]:
df.sample(n=3).T

Unnamed: 0,519513,699286,47243
user_id,3280,4303,329
movie_id,1208,534,1294
rating,5,4,3
timestamp,2000-09-05 15:50:18,2000-08-03 01:56:57,2000-12-26 15:43:33
gender,M,M,M
age,25-34,25-34,35-44
occupation,executive/managerial,artist,executive/managerial
zip,33486,91910,02115
title,Apocalypse Now (1979),Shadowlands (1993),M*A*S*H (1970)
genres,drama|war,drama|romance,comedy|war


In [55]:
df['rating'].mean()

np.float64(3.5835551153026475)

In [56]:
pd.set_option('display.max_rows', 21)

In [57]:
df.groupby('occupation')['rating'].mean()

occupation
K-12 student            3.549702
academic/educator       3.579064
artist                  3.584917
clerical/admin          3.653369
college/grad student    3.535004
customer service        3.549551
doctor/health care      3.650431
executive/managerial    3.595429
farmer                  3.419720
homemaker               3.656589
lawyer                  3.642639
other                   3.527849
programmer              3.661420
retired                 3.775022
sales/marketing         3.615692
scientist               3.685560
self-employed           3.590268
technician/engineer     3.612175
tradesman/craftsman     3.532170
unemployed              3.476621
writer                  3.519832
Name: rating, dtype: float64

## Saving data

### Saving data to .csv

Our data is ready now. Let's store it into a csv file. Note: the size of the file could be ~100MB.

In [58]:
# Save DataFrame into CSV file. 
df.to_csv('movie_lens_1M.csv', index=False)