LSE Data Science Institute | DS105A (2023/24) | Week 08

# 🗓️ Week 08: Pre-processing and grouping data with pandas, a groupby-apply tutorial

Theme: Cleaning and reshaping data

**LAST UPDATED:** 15 November 2023

**AUTHOR:** Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

-----


# **📚 PREPARATION**

1. Clone this repository to your computer.
2. Add it to your VS Code workspace.
3. Go to [IMDb Non-Commercial Datasets](https://developer.imdb.com/non-commercial-datasets/) page, and download all `tsv.gz` files from there, place all of that under the `data/raw/` folder. This folder is gitignored, we don't want to push large data files to GitHub!
4. Run:

    ```bash
    pip install -r requirements.txt
    ```

## ⚙️ Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from pprint import pprint
from tqdm.notebook import tqdm

# Configure some settings for high quality plots
%config InlineBackend.figure_formats = ['svg']
%matplotlib inline

# Part 1: Read zipped files

You will have noticed that the files we downloaded from IMDb are compressed, which means they were transformed from **plain text** into a convenient **binary format** that uses less space. This is a good practice when handling large text-based files, making them easier to store and transfer. However, we _do_ need to read the files into memory, and for that, we need to decompress them.

**How to decompress files?**

- If you are on Windows, you can use software like [7-Zip](https://www.7-zip.org/). After installing it, right-click on the file and select "Extract here".
- If you are on Mac, you can use the built-in Archive Utility. Right-click on the file and select "Open with" > "Archive Utility". Or, simply double-click on the file.

**🐼 `pandas` for the rescue!**

Luckily, `pandas` has our back. It can read compressed files directly without the need to decompress them first. If you specify the `compression` argument, the `pd.read_csv()` function can read compressed files directly  <sup>\[[1](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\]</sup>

If the file inside the compressed `.gz` file was a CSV, we could read it like this:

```python
pd.read_csv("../data/raw/name.basics.tsv.gz", compression="gzip").head()
```

However, note that the files we downloaded from IMDb are **TSV** files, not CSV. TSV stands for **Tab-Separated Values**, and it is a format similar to CSV, but instead of using commas to separate values, it uses tabs (`\t`). Why use TSV? Who knows. The developers thought it was a good idea, I guess. 

Anyway, since the format is still very similar to CSV, we can use the same function to read it, but we need to specify the `sep` argument to tell `pandas` that the separator is a tab (`\t`):

## 1.1: 👽 Establishing first contact with the data

In [None]:
# This will take a bit of time, it is a large file
df_name_basics = pd.read_csv("../data/raw/name.basics.tsv.gz", compression="gzip", sep="\t")

🗣️ **QUESTION TO THE CLASSROOM:** What are the first things we should do when we read a new dataset, **whether** collected by you or from someone else?

<div style="color:#f8f8f8">

<details><summary>Click here to see some hidden tips</summary>

Open new code cells with the following code:

```python
# Glimpse at the data
df_name_basics.head()
```

```python
# Get a bit of info on the data types and memory usage
df_name_basics.info()
```

```bash
# How does the memory usage compare to the file size?
!ls -lth ../data/raw/
```
</summary>

</div>

## 1.2 🗃️ Being your best perfectionist with data types

🎯 **ACTION POINTS:**

Work in groups (same composition as your group project) and do the following:

1. Disable GitHub Copilot, and don't use ChatGPT for now. Or it will make this less fun.

2. Modify the columns so that, in the end, they have the data types listed above. Try to deal with any errors that may arise.

3. Once you solved the errors, go to issue [#1](https://github.com/lse-ds105/w08-imdb-data/issues/1)(https://github.com/lse-ds105/w08-imdb-data/issues/1) that I created in this repository and add your group's solution. (add your group's name)

In [None]:
# Add your code here