## Data Exploring with Pandas

In the previous session, we prepared our version of the [MovieLens](https://grouplens.org/datasets/movielens/) data set. The data now includes movie ratings, movie metadata, and demographic data about the users.

It is time to start exploring the data. We will do it in *goal-oriented* manner: by trying to find the right questions and then looking for answers to those questions.

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [None]:
pd.set_option('display.max_rows', 15)
pd.set_option('display.precision', 2)
pd.options.display.float_format = '{:,.2f}'.format

<h3>Load and Summarize Dataset</h3>

Now it is time to load the file into a DataFrame and to take a quick look at the data.

In [None]:
df = pd.read_csv('movie_lens_1M.csv')

In [None]:
# Use pd.to_datetime to convert values in 'timestamp' column into datetime objects.
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S')

In [None]:
# This title is too long, let's shorten it.
df['title'].replace('Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)',
                    'Seven Samurai (Shichinin no samurai) (1954)', inplace=True)

We are ready to start asking questions...

<mark>**Q1** What are minimum, maximum, mean, and median rating?</mark>

<h3>GroupBy Mechanics</h3>

By "group by" we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.

See more info in pandas documentation: [Group By: split-apply-combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)

In [None]:
grouped = df.groupby('title')

In [None]:
movie_name, movie_data = None, None

for name, group_data in grouped:
    print("name: %s" % name)
    print("group shape: %s" % str(group_data.shape))
    movie_name = name
    movie_data = group_data
    break

In [None]:
movie_name

In [None]:
movie_data.head()

In [None]:
movie_data['title'].count()

In [None]:
gr = df.groupby('title')
gr.get_group("$1,000,000 Duck (1971)").head()

Let us group the data by `title` column and save the number of ratings for each move into `ratings_by_title` object.

In [None]:
ratings_by_title = 

### Series

The data structure Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the *index*.

Read more: [Data Structures: Series](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)

<mark>**Q2** How many times were movies rated on average?</mark>

<mark>**Q3** Which are Top 10 most rated movies? (Note: not best rated!)</mark>

### Indexing

We are now going to save the titles of all movies that were rated at least 250 times as `index`. Later we will be able to use this index to enable a quick access to movie titles.

Read more: [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

In [None]:
active_titles = 

In [None]:
active_titles[8]

<h3>Pivot Tables</h3>

A pivot table is a data summarization tool frequently found in spreadsheet programs and other analysis software. It aggregates a table of data by one or more keys, arranging the data in rectangle with some of the group keys along the rows and some along the columns.

Read more: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

![Pivot](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_pivot.png)

<mark>**Q4** What are Top 10 (best rated) movies according to females?</mark>

In [None]:
mean_ratings = df.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
mean_ratings[:5]

We are interested only in active titles.

In [None]:
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings.head()

In [None]:
top_female_ratings = 

<mark>**Q5** What are Top 10 (best rated) movies according to males?</mark>

In [None]:
top_male_ratings = 

<h3>Cross Tabulations: Crosstab</h3>

A cross-tabulation is a special case of a pivot table that computes group frequencies.

Use the crosstab function to compute a cross-tabulation of two (or more) factors. By default crosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.

Read more: [Cross tabulations](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#cross-tabulations)

In [None]:
pd.crosstab(df.gender, df.age)

<mark>**Q6** Which age groups are best represented?</mark>

*Visualize results in order to obtain a clear answer.*

In [None]:
# Add subtotals


In [None]:
# It is always a good idea to visualize data when possible


<mark>**Q7** Which occupations are best represented?</mark>

*Visualize results in order to obtain a clear answer.*

In [None]:
pd.set_option('display.max_rows', 23)
movies_crosstab = 

In [None]:
pd.set_option('display.max_rows', 15)

### Stacking and Unstacking

Closely related to the pivot function are the related stack and unstack functions. These functions are designed to work together with MultiIndex objects. Here are essentially what these functions do:

The clearest way to explain is by example.

Read more: [Reshaping by stacking and unstacking](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-stacking-and-unstacking)

In [None]:
# The data structure of movies_crosstab is actually DataFrame...
type(movies_crosstab)

In [None]:
# However, not an ordinary one...
movies_crosstab.columns

In [None]:
# Create a subset in order to easier deal with the data.
movies_crosstab_education = pd.crosstab([df.gender, df.age], df.occupation[df.occupation == 'academic/educator']).T
movies_crosstab_education

![Stack](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_stack.png)

#### Stack

"Pivot" a level of the column labels, returning a DataFrame with an index with a new inner-most level of row labels.


In [None]:
stacked = 

![Unstack](https://pandas.pydata.org/pandas-docs/stable/_images/reshaping_unstack.png)

#### Unstack

Inverse operation from stack: "pivot" a level of the row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

<mark>**Q8** What are Top 10 (best rated) movies according to the age group 18-24?</mark>

<mark>**Q9** What are Top 10 (best rated) movies according to females in age group 18-24?</mark><br>

<mark>**Q10** What are Top 10 (best rated) movies according to males in age group 25-34?</mark>

In [None]:
top_young_male_ratings = 

<mark>**Q11** How many people have rated the movie called "Ferris Bueller's Day Off (1986)"?</mark>

<mark>**Q12** What are Top 10 movies where the ratings of female and male users differ the most?</mark>

<mark>**Q13** What are Top 10 movies where the ratings differ the most?</mark>

*Hint: use standard deviations.*

In [None]:
# Standard deviation of rating grouped by title
rating_std_by_title = 

In [None]:
# Filter down to active_titles
rating_std_by_title = 

<mark>**Q14** What are Top 10 states where the most user votes came from?</mark>

In [None]:
# How many states are represented in the data?


<mark>**Q15** What are ten most represented genres?</mark>

In [None]:
# How many different genres are there in the data?
