# Basic Data Processing with Pandas - Part 1

**What is Pandas for?**

This tool is essentially your data’s home. Through Pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

For example, Pandas can extract the data from  CSV file into a DataFrame — a table, basically — then lets you do things like:

- Calculate statistics and answer questions about the data, like:
    - What's the average, median, max, or min of each column?
    - Does column A correlate with column B?
    - What does the distribution of data in column C look like?
    
    
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria

- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.

- Store the cleaned, transformed data back into a CSV, other file or database

In this tutorial, we will work with a dataset with information about movies from IMDB. We will use Pandas to answer questions such as:
- What are the average ratings of movies by year?
- What genres are the most highly rated?
- What directors bring the most revenue to the studio?

# Getting Started

Pandas is a not a built-in Python library, so we need to install it first. We will use the `pip` package manager to install Pandas. If you are using Colab, Pandas is already installed. If you are using your own computer, you can install Pandas by running the following command:

``` !pip install pandas ```

The ! tells the notebook to run the following command in the terminal. `pip` is the package manager that comes with Python. Usually, you can install a package by running `pip install <package_name>`.

To import pandas, we use the following command:

```import pandas as pd ```

The `pd` is a common alias for pandas. It is used to save typing and to distinguish pandas from other libraries that are also imported into the notebook.

## Basic Definitions

The primary two components of pandas are **Series** and **DataFrame**.

### Series

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of `Series`:

`class pandas.Series(data=None, index=None...)`

**data:** Contains data stored in Series.\
**index:** With the `index` argument, you can name your own labels.



In [None]:
import pandas as pd

data = [2, 4, 5, 6, 9]

series = pd.Series(data, index=[1, 3, 5, 7, 9])

print(series)

The `index` is like the row labels of a spreadsheet. It is a list of values that uniquely identify each row. If you don't specify an index, one will be created for you from the data. Any list of values can be used as an index, but it is usually either integers or strings.

In [None]:
menu = pd.Series([8.5, 3.0, 10.0], index=["salad", "soup", "pizza"])
print(menu)
print(menu["soup"])

### DataFrame

A `DataFrame` is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds with a row (or record) and a column.


Let's consider we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [None]:
data = {
    "apples": [3, 2, 0, 1], 
    "oranges": [0, 3, 7, 2]
}

print(data)

Now converting to a `DataFrame`:

In [None]:
purchases = pd.DataFrame(data)

print(purchases)

**How did that work?**

Each (key, value) item in data corresponds to a column in the resulting DataFrame.

`class pandas.DataFrame(data=None, index=None, columns=None, dtype=None...)`

**data:** Data can be ndarray (structured or homogeneous), Iterable, dict, or DataFrame.

**index:** Index to use for resulting frame.

**columns:** Column labels to use for resulting frame when data does not have them. If data contains column labels, will perform column selection instead.

**dtype:** Data type to force. Only a single dtype is allowed.

Let's have customer names as our index:

In [None]:
purchases = pd.DataFrame(data, index=["June", "Robert", "Lily", "David"])

print(purchases)

So now we could locate a customer's order by using their name:



In [None]:
purchases.loc["June"]

### Task: Create a Dataframe with 3 columns and 3 rows. The columns should be named `Name`, `Age`, and `Favorite Color`. Fill in the rows with any data you like. The `Name` column should be the index.



In [None]:
# Task1 solution -- there are different ways of creating the same dataframe. This is just one of them, but other solutions are also correct.
data = [["Jim", 30, "green"], ["Pam", 27, "yellow"], ["Michael", 45, "red"]]
df = pd.DataFrame(data, columns=["Name", "Age", "Favourite Colour"])
df


# Loading DataFrames

Often, you won't be creating DataFrames from scratch. Instead, you will be loading them from files. Pandas can read a variety of file types using its `pd.read_` functions. CSV files are one of the most common, so we will start there.

## Uploading Data

Let's first download and import the dataset we will be working with. First, download the file `IMDB-Movie-Data.csv` from Canvas. 

Then, upload it to your Colab notebook by clicking on the folder icon on the left side of the screen. Click on the `Upload` button and select the file. You should see the file in the file explorer on the left side of the screen. If you right click on the file, you can copy its path by clicking on `Copy Path`.

You can also sync your Google Drive with Colab. To do this, click on the folder icon on the left side of the screen. Click on the `Mount Drive` button. You will be prompted to authenticate your Google account. Once you do that, you will be able to access your Google Drive files from Colab. You can upload the file to your Google Drive and then access it from Colab.

If you are using a local Jupyter notebook, you can just save the file in the same directory as your notebook.

## Loading CSV Files

CSV stands for "comma-separated values". CSV files are a common way to store tabular data. They are plain text files with a specific structure. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The first line of the file usually contains the names of each column.

With CSV files, all you need is a single line of code:
`pandas.read_csv(filepath)`

There are many other arguments you can pass to `read_csv`, but we will only use the `filepath` for now.

Use the help function to learn more about `read_csv`:

```help(pd.read_csv)``` or ```pd.read_csv?```

In [None]:
#The default value is index_col=None 
movies_df = pd.read_csv("datasets/IMDB-Movie-Data.csv")

#If we set index_col=0, we're explicitly stating to treat the first column as the index:
movies_df = pd.read_csv("datasets/IMDB-Movie-Data.csv", index_col=0)

movies_df

## Exporting CSV Files

You can export a DataFrame to a CSV file using the `DataFrame.to_csv` method. The method has the following signature:

`DataFrame.to_csv(path_or_buf=None, sep=',', columns=None, header=True...)`

Use the help function to learn more about `to_csv`:

```help(pd.DataFrame.to_csv)``` or ```pd.DataFrame.to_csv?```

In [None]:
movies_df.to_csv("new_file.csv")

# Exploring your DataFrame

Now let's learn some ways to explore your DataFrame. First, let's see some methods for checking the data within the DataFrame.

## Accessing Data

`DataFrame.head(n)` returns the first n rows of the DataFrame.

In [None]:
movies_df.head(5)

`DataFrame.tail(n)` returns the last n rows of the DataFrame.

In [None]:
movies_df.tail(5)

`DataFrame.sample(n)` returns a random sample of n rows from the DataFrame.

You can also use `DataFrame.sample(nfrac)` to return a random sample of nfrac fraction (a percentage) of rows from the DataFrame.

In [None]:
# Run this cell a few times to see different samples
movies_df.sample(5)
# Now sample 5% of the DataFrame

### Accessing Columns

You can access a specific column by using the following syntax:

In [None]:
movies_df["Title"]
# Or
movies_df.Title

The square brackets + string with column anme syntax works for any column name.

The `.` notation only works if the column name is a valid Python variable name. For instance, `df.Movie Title` will not work, but `df.movie_title` will. 

To make sure your code always works, you can use the square brackets + string syntax.

### Accessing Rows

You can access a specific row in two ways:

- `df.loc` - locates by index name (row label). In our case, the index is the ranking of the movie.
- `df.iloc` - locates by numerical index (row number).

Note that `df.loc` and `df.iloc` are not methods, but attributes. This means that you don't use parentheses to call them. You just use them like this: `df.loc[1]` or `df.iloc[1]`.

The arguments for `df.loc` can also be a list of indices. For example, `df.loc[[1, 2, 3]]` will return the top three ranked movies.

Remember that DataFrame indices can also be strings.

In [None]:
# Returns the movie with Rank 1
print(movies_df.loc[1])
# Returns the movie in the first row
print(movies_df.iloc[0])

`df.loc` and `df.iloc` also work for accessing columns. For example, `df.loc[1, 'Title']` will return the title of the movie with index 1.

You can access all rows for a specific column by using `:` as the first argument. For example, `df.loc[:, 'Title']` will return all the titles.

### Task: trying out accessing methods

- Change the index of your DataFrame to the column `Title`. Then, use `df.loc` to find out the rating of the movie `The Dark Knight Rises`.
- Does slicing work with DataFrames? Try to access rows 10 to 20 using slicing.
- Does slicing work with indices that are strings? Try to slice the `Title` index.
- Use slicing to access all rows for the columns `Title` and `Rating`.
- Use slicing to access all rows in reverse order.

In [None]:
# 1. 
movies_df.set_index("Title").loc["The Dark Knight Rises"]

# 2. 
movies_df[10:20]

# 3.
movies_df.set_index("Title")["Guardians of the Galaxy":"Sing"]

# 4.
movies_df.loc[:, ["Title", "Rating"]]

# 5.
movies_df[::-1]

## Conditional Selection

Pandas makes it easy to select rows based on a condition. For example, if we want to select all the movies with a rating of 8.5 or higher, we can do the following:

In [None]:
movies_df[movies_df["Rating"] >= 8.5]

You can combine multiple conditions using the operators `&` (and) and `|` (or). For example, if we want to select all the movies with a rating of 8.5 that came out after 2009, we can do the following:

In [None]:
movies_df[(movies_df["Rating"] >= 8.5) & (movies_df["Year"] >= 2010)]

In Pandas, to negate a condition, you use the `~` operator. For example, if we want to select all the movies with a rating of 8.5 or lower, we can do the following:

In [None]:
movies_df[~(movies_df["Rating"] > 8.5)]

### Task: conditional selection

- Select all movies by director David Yates with a Metascore of 70 or higher.
- Check how many movies have both a rating of 8.5 or higher and a Metascore of 70 or higher.
- Check how many movies not directed by Ridley Scott were released after 2015 or before 2010.

In [None]:
# 1. 
movies_df[(movies_df.Director == "David Yates") & (movies_df.Metascore >= 70)]

# 2. 
len(movies_df[(movies_df.Rating >= 8.5) & (movies_df.Metascore >= 70)])

# 3.
len(movies_df[(movies_df.Director != "Ridley Scott") & ((movies_df.Year > 2015) | (movies_df.Year < 2010))])

### Querying

You can also use the `DataFrame.query` method to select rows based on conditions. For example, if we want to select all the movies with a rating of 8.5 or higher that were released after 2010, we can do the following:

In [None]:
movies_df.query("Rating >= 8.5 and Year > 2010")

Both methods work the same way. The only difference is that `DataFrame.query` is more convenient to use when you have a lot of conditions because the syntax is more compact.

You can find a nice list of examples of query examples [here](https://sparkbyexamples.com/pandas/pandas-dataframe-query-examples/).

**Task:**  Use querying to check how many movies not directed by Ridley Scott were released after 2015 or before 2010.

In [None]:
len(movies_df.query("Director != 'Ridley Scott' and (Year > 2015 or Year < 2010)"))

## Basic Statistics

Pandas makes it easy to calculate basic statistics for your DataFrame. Let's start by calculating a descriptive overview of the numerical columns in our DataFrame.

We can do that by using the `DataFrame.describe` method. This method returns a DataFrame with the following statistics:

In [None]:
movies_df.describe()

We can also calculate the mean, median, and standard deviation of a column using the `DataFrame.mean`, `DataFrame.median`, and `DataFrame.std` methods.

### Task: basic statistics
- Compare the mean, median, and standard deviation of the `Metascore` and `Rating` columns.
- Calculate the mean revenue of movies directed by Christopher Nolan.
- Compare the average runtime of Comedy and Horror movies.

In [None]:
# 1.
movies_df[["Metascore", "Rating"]].describe() 

# 2.
movies_df.query("Director == 'Christopher Nolan'")["Revenue (Millions)"].mean()

# 3.
print(movies_df.query("Genre == 'Comedy'")["Runtime (Minutes)"].mean())
print(movies_df.query("Genre == 'Horror'")["Runtime (Minutes)"].mean())

## Aggregation

So far, we have been calculating statistics for a single column. But what if we want to calculate statistics for multiple columns? For that, we can use aggregation methods.

In this tutorial, we will learn about the `DataFrame.groupby` method. This method allows us to group rows based on a column and then calculate statistics for each group.

For example, if we want to calculate the average rating for each director, we can do the following:

In [None]:
movies_df.groupby("Director")["Rating"].mean()

We can analyse multiple columns at the same time by passing a list of column names to `DataFrame.groupby`. For example, if we want to calculate the average rating and Metascore for each director, we can do the following:

In [None]:
movies_df.groupby("Director")[["Rating", "Metascore"]].mean()

There are many other aggregation methods that you can use with `DataFrame.groupby`. You can find a list of them [here](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html).

We will learn more about aggregation methods in other tutorials.

## Sorting

To sort our DataFrame (or a subset of it) by a column, we can use the `DataFrame.sort_values` method. For instance, to sort movies by rating in descending order, we can do the following:

In [None]:
movies_df.sort_values("Rating", ascending=False)

# Putting it all together

Now we have learned all the concepts we need to answer our original questions about the data. Go ahead and implement them!

- What are the average ratings of movies by year?
- What genres are the most highly rated?
- What directors bring the most revenue to the studio?

**Bonus tasks**:
- Create a DataFrame with the 10 best rated and the 10 worst rated movies. Save it as a CSV file.
- Select all movies directed by the top 3 highest rated (by Metascore) directors.
- Find how many unique directors are there in the dataset.
- Find out the number of movies released by year in the dataset.

In [None]:
# 1.
movies_df.groupby("Year")["Rating"].mean().sort_values(ascending=False)

In [None]:
# 2.
movies_df.groupby("Genre")["Rating"].mean().sort_values(ascending=False)

In [None]:
# 3.
movies_df.groupby("Director")["Revenue (Millions)"].mean().sort_values(ascending=False)

Bonus tasks

There are different ways to solve these tasks. These solutions are using methods we will learn in the next tutorials.

In [None]:
# 1. 
# We first sort the dataframe by Rating in descending order
movies_df_by_rating = movies_df.sort_values("Rating", ascending=False)
# Then we create a new one by concatenating the first 10 rows with the last 10 rows. Check the documentation for pd.concat!
new_df = pd.concat([movies_df_by_rating.head(10), movies_df_by_rating.tail(10)])
new_df.to_csv("top_and_bottom_10.csv")

In [None]:
# 2.
# We first select the top 3 directors by mean metascore
top_3_directors = movies_df.groupby("Director")["Metascore"].mean().sort_values(ascending=False).head(3).index
# Then we query the dataframe to only keep the rows where the director is in the top 3
movies_df.query("Director in @top_3_directors")
# We can also use the isin() method. Check the documentation for more info!
movies_df[movies_df["Director"].isin(top_3_directors)]

In [None]:
# 3.
movies_df.Director.nunique()
# This is the same as doing: 
len(movies_df.Director.unique())
# But nunique() is much more readable, so it's better to use it

In [None]:
# 4.
# The count function counts the number of non-null values in a column
movies_df.groupby("Year").count()["Title"]