**Previous book**: <a href='./01_initial_inspection.ipynb'>[Intial Inspection]</a>

## Part 2: Column refactoring
**NOTE**: This notebook expects <code>stage1_cleaned.csv</code> from the <code>01_initial_inspection.ipynb</code> notebook!


In the next stage of cleaning, we will restructure the columns of the dataset to improve its organisation and readability while keeping the underlying information intact. This process will divide existing columns into multiple new columns, and rename columns to better describe their contents.

**Input:** <code>/data/interim/stage1_cleaned.csv</code>
<br>
**Output:** <code>/data/interim/stage2_refactored.csv</code>
### Initial imports
To start, we need to import the modified dataset from the previous stage, <code>stage1_cleaned.csv</code> found in the <code>interrim</code> folder.

In [1]:
import numpy as np
import pandas as pd
import re

# Ensure int type columns import as int (this happens because missing values are interpreted as floats)
df = pd.read_csv("../data/interim/stage1_cleaned.csv",dtype={"VOTES": "Int64","RunTime": "Int64"})
df.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"Action, Horror, Thriller",6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth| Stars:Peri Baume...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"Animation, Action, Adventure",5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"Drama, Horror, Thriller",8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,
3,Rick and Morty,(2013– ),"Animation, Adventure, Comedy",9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,
4,Army of Thieves,(2021),"Action, Crime, Horror",,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer| Stars:Matt...,,,


### Refactoring
This next step of cleaning is more transformative, restructuring the underlying data while trying to preserve as much information as possible.
</p>
<b>Regular Expressions</b>, or Regex, are a powerful tool here. Regex can be used to match a certain series of characters in a larger string. 
The following regex function can match a specific string of characters in a column, and extract the information and assign it to a different column:

In [2]:
def regex_extract(df, new_column, old_column, regex_pattern):
    df[new_column] = df[old_column].str.extract(regex_pattern)

#### YEAR
As shown earlier, the <code>YEAR</code> column contains three pieces of information, when a production was released, when it ended (if applicable) and what type of production it was.

Using the <code>regex_extract</code> function to extract each piece of information: 

In [3]:
regex_extract(df, "start_year", "YEAR", r"\((\d{4})")
regex_extract(df, "end_year", "YEAR", r"\(\d{4}–(.*)\)")

In [4]:
df["start_year"].value_counts().head(10)

start_year
2020    1676
2019    1334
2018    1074
2021     987
2017     835
2016     579
2015     426
2014     360
2013     263
2012     188
Name: count, dtype: int64

Any row with a single blank space (not a missing value) does not have an end year, and is considered to be 'ongoing' (i.e. not yet finished):

In [5]:
df.loc[df["end_year"] == " ", "end_year"] = "Ongoing"

In [6]:
df["end_year"].value_counts().head(10)

end_year
Ongoing    2911
2020        446
2021        208
2019        143
2012         94
2018         84
2008         78
2013         72
2014         57
2017         57
Name: count, dtype: int64

Most of the records for <code>end_year</code> are <code>Ongoing</code>. However, as this dataset is from 2021, the information is not up-to-date. It is necessary to consult an external source in a future stage of this project.

The last piece of information in the <code>YEAR</code> column is the production type. This can be extracted using the <code>apply</code> method of pandas.

In [7]:
def production_type(entry):
    try:
        entry_string = re.findall(r"([^\W\d]+(?:\s[^\W\d]+)?)", entry)
        return entry_string[0]
    # IndexError - no matches found
    # TypeError - no string returned
    except (IndexError, TypeError):
        return np.nan

df["production_type"] = df["YEAR"].apply(production_type)

All the information in the <code>YEAR</code> column is preserved in new columns, and so is now redundant. Drop the redundant column:

In [8]:
df = df.drop("YEAR", axis=1)

#### STARS
As shown previously, the <code>STARS</code> column contains semi-structured data unsuitable for a table in a relational database. Extract the actors into an <code>actors</code> column and convert to a list:

In [9]:
regex_extract(df, "actors", "STARS", r"(?:Star[s]?:)(.+)")

Do the same for directors:

In [10]:
regex_extract(df, "director", "STARS", "(?:Director[s]?:)([^|]+)")

With all the information extracted from the <code>STARS</code> column, it can be safely dropped:

In [11]:
df = df.drop("STARS", axis=1)

### Renaming and reordering
Rename the columns to be both descriptive and consistently formatted:

In [12]:
new_names = {
    "MOVIES": "title",
    "GENRE": "genre",
    "RATING": "rating",
    "ONE-LINE": "summary",
    "VOTES": "votes",
    "RunTime": "run_time",
    "Gross": "gross"
}

df = df.rename(new_names, axis=1)

Finally, reordering the columns into a more logical order:

In [13]:
ordered_columns = [
    "title",
    "start_year",
    "end_year",
    "genre",
    "summary",
    "rating",
    "votes",
    "run_time",
    "gross",
    "production_type",
    "actors",
    "director"
]

df = df.loc[:, ordered_columns]

### Refactoring summary
Re-examine the head of the data frame to check the changes:

In [14]:
df.head()

Unnamed: 0,title,start_year,end_year,genre,summary,rating,votes,run_time,gross,production_type,actors,director
0,Blood Red Sky,2021,,"Action, Horror, Thriller",A woman with a mysterious illness is forced in...,6.1,21062.0,121.0,,,"Peri Baumeister, Carl Anton Koch, Alexander Sc...",Peter Thorwarth
1,Masters of the Universe: Revelation,2021,Ongoing,"Animation, Action, Adventure",The war for Eternia begins again in what may b...,5.0,17870.0,25.0,,,"Chris Wood, Sarah Michelle Gellar, Lena Headey...",
2,The Walking Dead,2010,2022,"Drama, Horror, Thriller",Sheriff Deputy Rick Grimes wakes up from a com...,8.2,885805.0,44.0,,,"Andrew Lincoln, Norman Reedus, Melissa McBride...",
3,Rick and Morty,2013,Ongoing,"Animation, Adventure, Comedy",An animated series that follows the exploits o...,9.2,414849.0,23.0,,,"Justin Roiland, Chris Parnell, Spencer Grammer...",
4,Army of Thieves,2021,,"Action, Crime, Horror","A prequel, set before the events of Army of th...",,,,,,"Matthias Schweighöfer, Nathalie Emmanuel, Ruby...",Matthias Schweighöfer


In this part, the following changes were made to the data:
<ul>
    <li>Start and end years extracted from the <code>YEAR</code> column.</li>
    <li>Created a column for the type of production also extracted from the <code>YEAR</code> column.</li>
    <li>Actors and directors extracted from the <code>STARS</code> column.</li>
    <li>Columns were renamed and reordered to be in a more logical order</li>
</ul>

At this point, the data is clean, the columns are of the correct data type, cell contents are correctly formatted and most of the original data is retained. It may now be used with minimal changes in a new project.
</p>
Exporting the current dataframe as a CSV file named <code>stage2_refactored.csv</code>:

In [15]:
df.to_csv("../data/interim/stage2_refactored.csv", index=False)

This file will be used as the starting point for the next part, <i>data subsetting</i>, where we focus on a specific type of production.

### Navigation
**Previous book**: <a href='./01_initial_inspection.ipynb'>[Intial Inspection]</a>

**Next book**: <a href='./03_data_subsetting.ipynb'>[Data Subsetting]</a>