# Introduction to Pandas -  Part 1: Data Cleaning

Long gone are the days of pen and paper research. Whether it is for data analysis, data cleaning or data manipulation, proficiency in digital tools has become an unavoidable skill to nurture and exploit. The social sciences are no exception. The reality for a vast number of researchers in psychology today still involves long hours of click-intensive, long and repetitive tasks in front of spreadsheets (*i.e.* Microsoft Excel) or similar statistical analysis software packages (*i.e.* IBM's SPSS). Further, any new iteration of the analysis consists of repeating the same steps and manipulations manually as if starting anew. This not only harms reproducibility (a foundational goal of scientific research and a documented pitfall in thr humanities), but also limits the capacity of research teams to develop an expertise in terms of reliable tools and processes that can be reused in different projects and passed on to new students or collaborators. Not to mention, the cumulative hours lost for graduate students performing menial tasks that could be better spent in more creative, cognitive tasks. 

For these reasons, nowadays graduate students are increasingly encouraged to take on workshops or mini-courses of basic programming and it is not rare to find at least one proficient developer in research teams whose subject of study might very well be far from computer science. In such cases, the most popular choice of programming language is Python. With a very rich ecosystem of data science tools, Python is a general purpose language that allows to create automatized pipelines that can vary from faster, more reliable Excel-like operations, to automatized data gathering through web scraping or web APIs, perform statistical analysis, create beautiful, publication-ready plots, building advanced predictive models and much more.


This post was orignially intended to be a simple analysis on the Bechdel test data (intro soon). However, by popular demand from friends and collegues who enjoyed my previous tutorials on matplotlib (LINKS!!!), I decided to release this parallel serires that would act as an introduction to pandas for all of you out there that are recently discovering this pillar of python data science or those who might not yet pandas good and want to learn to do other stuff good too. If you are already comfortable with pandas and/or are  interested only in the analysis, you can look at the original analysis at this link. (LINK!!)

So, for the most inexperienced among us, what is Pandas? At its simplest, it is the sovereign data manipulation tool in your python data science toolbelt when you are dealing with data that fits nicely into a table format. If you have data in an excel file, csv, sql database, etc. and you are planning on cleaning it, exploring it, analyzing it or really do anything at all with it, you should immediately think of pandas (the bears first, then the library). Pandas makes it very easy to act on tabular data through their DataFrame (2D table, as in rows and columns) and Series (1D as in just a column) objects which come packed with fast and scalable functions to most tasks you might need to do. In this first post, I'll introduce some of these functions, with the goal of preparing a dataset for exploratory analysis. In later post, we will take this clean data and work on it also with pandas to derive and share new insights from it.

I could praise pandas for days, but I better let it speak for itself. Before that, a quick intro to the data used in this tutorial. As a favor to a friend, we will all be looking at the [Bechdel test movie list].(https://bechdeltest.com/). The bechdel test, is a scoring system for movies popularized by a 1985 comic that gives a score of 1 if the movie has at least two named women in it (0 if it doesn't), a 2 if they talk to each other and a 3 if they talk together about something other than a man. It is a great dataset for practice, because it is easily accesible, the fact that anyone can contribute to it gives rise to good opportunities to practice cleaning and since it comes with an imdb-id column for the movies, we will be able to mix it with other movie datasets (with info on movie rating, budgets, etc) for new juicy insights.

So, a little less conversation and a little more action: we will start by importing the necessary libraries. For today, just pandas and pathlib's Path to help dealing with file paths:

## Libraries

In [1]:
# Pathing
from pathlib import Path

# Data structures
import pandas as pd

## General Parameters 
It is good practice to define any hard-coded values together at the beginning of your code, so we first define some general parameters that we will use throughout (proper style guidelines might encourage to capitalize them too, but I'm still not a fan of shouting in my code). That way if for some reason we change a path or a file name, we can do it once here and know the rest of our code will run smoothly without errors.

In [2]:
url = 'http://bechdeltest.com/api/v1/getAllMovies' # The original soource of the data

data_dir = Path('../data') # define a data folder one folder up

bechdel_path = data_dir/'bechdel.csv' # default save example of the data
bechdel_raw_path = data_dir/'bechdel_raw.tsv.gz' # proper save of the raw data
bechdel_frozen_path = data_dir/'bechdel_frozen.tsv.gz' # frozen data file
bechdel_clean_path = data_dir/'bechdel_clean.tsv.gz' # clean frozen data. What will be used for analysis


And in case you don't have the data folder already, let's create it before going forward:

In [22]:
data_dir.mkdir(parents=True, exist_ok=True)

## The Data - Getting, saving and reading data
First things first, we need to go get the data. Often, this might be in the form of a .txt, .csv or .tsv saved on your local machine. 

For the bechdel data, we are lucky to have acces to an easily accesible table at [this website](https://bechdeltest.com/) with a super simple API which we can (non-coincidentally) used through the url variable defined above. In most cases, to fetch data online, depening on how the oter side handles their data hosting, you would need to either use the great *requests* library and/or if you need to scrape a web page, something like *beautifulsoup*. Here though, to get all the movies and their ratings it will be as easy as using the following line:

In [23]:
data = pd.read_json(url)

For those not used to the *json* format, it is very similar to a python dictionnary and its a bit of a standard when doing web requests like we did. In this case, we are essentially telling pandas to go get the json (*i.e.* dictionary) of all movies from a remote server (the website's) and convert it into a regular DataFrame object. the most basic thing to do with it at this point is just to print it to see what it looks like:

In [24]:
data

Unnamed: 0,id,title,imdbid,year,rating
0,8040,Roundhay Garden Scene,0392728,1888,0
1,5433,Pauvre Pierrot,0000003,1892,0
2,6200,"Execution of Mary, Queen of Scots, The",0132134,1895,0
3,5444,Tables Turned on the Gardener,0000014,1895,0
4,5406,Une nuit terrible,0000131,1896,0
...,...,...,...,...,...
8887,9501,Ginny and Georgia,10813940,2021,2
8888,9504,Raya and the Last Dragon,5109280,2021,3
8889,9519,Coming 2 America,6802400,2021,3
8890,9526,Moxie,6432466,2021,3


From this quick glance, we see we have 5 columns: their internal id, the title of the movie, the [imdb](https://www.imdb.com/) id, year the movie was releasd and bechdel rating. For now, we will leave the cleaning and exploration at that (not for long).

### Saving

First, we will save the data locally into a .csv file so we don't have to re-download the whole dataset from the website if we were to mess it up and it gives us a chance to look at writing and reading dataframes with pandas, in the more traditional, tutorial-like way. So, first, saving it locally:

In [25]:
data.to_csv(bechdel_path)

That's the most basic way to save the file. You should see on your directory that a new bechdel.csv file appeared and if you were to open it, you could see a neat table with all of the data in it separated by commas (as the name .csv implies...). Out of the many options worth looking into in the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) for saving to csv are:
* *sep*: if you pan on using a column delimiter other than the comma ',' (useful if you will be storing text with commas in them for example).
    * Notably to save with a tab delimiter (a .tsv essentialy), simply use sep='\t' (of course, you would need to change the extension name from the filename yourself, since extensions are only for humans anyway).
* *header* and *index*: to determine whether you want to save the index column and/or the header row.
* *mode*: if you want to write a new file (and overwrite if the file exists already) use the default 'w' or if you want to append to an existing file use 'a'.
* *compression*: if you want your file to be compressed when saved (great for large files). For example, using 'gzip' (again, good practice to change the extension of the file if you compress.)

To keep it clean and tiny, I will save the dataframe we got from the web again, but this time with tabs instead of commas, in case commas appear in movie titles and compressed (also manually deleting the first .csv we created with the previous line of code). We will also tell it that it's not necessary to save the index, since it's just the row number:

In [6]:
data.to_csv(bechdel_raw_path, sep='\t', index=False, compression='gzip')

### Reading
We downloaded and saved the data table locally. However, since the time I wrote this and whenever you are reading it, there might be some new entries, or changes to the data lready at the website. So to amek sure that you can follow the tutorial with me, I saved the data from when I wrote this in a separete file:

In [3]:
data = pd.read_csv(bechdel_frozen_path, sep='\t', dtype={'imdbid':str})
data

Unnamed: 0,year,imdbid,rating,title,id
0,1888,0392728,0,Roundhay Garden Scene,8040
1,1892,0000003,0,Pauvre Pierrot,5433
2,1895,0132134,0,"Execution of Mary, Queen of Scots, The",6200
3,1895,0000014,0,Tables Turned on the Gardener,5444
4,1896,0000131,0,Une nuit terrible,5406
...,...,...,...,...,...
8834,2021,5144174,3,"Dry, The",9498
8835,2021,10919362,3,Sweetheart,9505
8836,2021,10813940,2,Ginny and Georgia,9501
8837,2021,5109280,3,Raya and the Last Dragon,9504


Nothing very fancy there. As before, we changed the delimiter to tabs ('\t'), since that's how we saved it. Notice that we didn't need to explicitely tell it that the file was compressed... prety neat. We also had to specify the datatype of at least one column, the imdbid column. That is because pandas by default will try to infer the data type of ecach column. Since imdbid can be transformed into numbers, we would get a column of floats. That is problematic because any entry that starts with 0s would lose them. For example, the id for Pauvre Pierrot (entry 1) which has an imdbid of 0000003 would be converted to 3.0. That could be problematic since it doesn't accurately represent the true imdbid and would make cross-referecing harder (among other issues).

Generally, if you know in advance the data types and its not incredibly troublesome to do so, one could/should specify the data types of the columns in the *read_csv* function itself with the key-wod *dtypes*. Other importatn keywords for *.read_csv* that can be found on the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) worth remembering are:

* *header*,  *index_col*: To specify the row to use as headers and/or the row to use as index. You'll mostly use it when you don't actually have a header and pandas still ueses your first row of data  as a header and/or when you have an index but pandas think it's another data column, and adds a secodn index to the dataframe.

* *usecols*: if you only need some columns, but not all, you can specify which ones here.

* *names*: you can specify the name of the columns if you didn't specify the header.

* *delim_whitespace*: if the separator is empty space between columns (tab '\t' included) you can use this to read them.

There's quite a lot more of options, so if at any point you need to load a dataframe in some special way not covered by these commands, you should go look there first.

## The DataFrame - looking at the object, before looking at what's in it.
#### .head, .tail, .sample, .info, .shape, .columns, .dtypes,

Before looking at the details of the data, it pays off to have an idea of what the dataframe itself looks like: how big is it? what type of data is encoded?This will let us know how much we will be able to do with the content itself.

First, and most common, its to look at some entries to get a glimpse of what we can find. We can look at the top, the bottom or some randomly chosen sample of  rows:

In [4]:
data.head(3)

Unnamed: 0,year,imdbid,rating,title,id
0,1888,392728,0,Roundhay Garden Scene,8040
1,1892,3,0,Pauvre Pierrot,5433
2,1895,132134,0,"Execution of Mary, Queen of Scots, The",6200


In [5]:
data.tail(2)

Unnamed: 0,year,imdbid,rating,title,id
8837,2021,5109280,3,Raya and the Last Dragon,9504
8838,2021,9286908,2,High Ground,9500


In [6]:
data.sample(6)

Unnamed: 0,year,imdbid,rating,title,id
4622,2006,475944,3,"Covenant, The",6352
7705,2016,5598100,1,Patients,7911
6409,2012,1217209,3,Brave,3379
1313,1972,68646,2,"Godfather, The",2224
4743,2007,1007920,3,Barbie Fairytopia: Magic of the Rainbow,7714
5613,2010,1521848,3,Potiche,2264


For the head and tail functions, one can simply specify how many we want to see and the yget deliverred. By default, head and tail return 5 elements. Sample by default will return only one element. To tailor the sampling itself, the function gives you access to parameters such as sampling weights, a random seed and even the possibility to sample columns if needed. As always, I recommend checking the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html).

Now, for a general look at the dataframe, we can use the following function:

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    8839 non-null   int64 
 1   imdbid  8835 non-null   object
 2   rating  8839 non-null   int64 
 3   title   8839 non-null   object
 4   id      8839 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 345.4+ KB


We get all of our columns, how many rows, what type of data they contain and how many NaNs in each column. We could have gotten similar info with:

In [8]:
data.shape

(8839, 5)

In [9]:
data.columns

Index(['year', 'imdbid', 'rating', 'title', 'id'], dtype='object')

In [10]:
data.dtypes

year       int64
imdbid    object
rating     int64
title     object
id         int64
dtype: object

In [11]:
data.notna().sum()

year      8839
imdbid    8835
rating    8839
title     8839
id        8839
dtype: int64

This should all be relatively self-explanatory, except maybe for the last one. I'll take about specifically about the *notna()* function in a bit, when we look at the data. One thing important to notice and one of my favourite features of pandas (as you will see in the rest of the tutorial) is how we can chain a series of functions to obtain the results we want. In this case, on data, we used the *.notna()* function followed by the *.sum()* (since all the True from *.notna()* equal 1 and the False 0) function to obtain the total number of non-nan elements per column.

## The Data - Clean it before using it to save future headaches
#### .isna, .notna

while it might be very tempting to want to jump right into plotting and moving the data raound, it is important first to at least perform some basic cleaning. At the very least, that should involve taking care of NaN values and as we saw before, making sure the data types correspond to what they should.

We will first take care of the nans, since, as with most functions, having NaNs present won't let us do much without errors or unwanted behaviour when trying other functions. Let's look at where the NaNs are:

In [22]:
data.isna().sum()

year      0
imdbid    4
rating    0
title     0
id        0
dtype: int64

### Short aside - Indexing
#### .loc, .iloc
Only the imdbid column contains any nans, let's look at them, by using *.isna()* only on the imdbid column and using the returned boolean Series as row positions to keep:

In [23]:
data['imdbid'].isna()

0       False
1       False
2       False
3       False
4       False
        ...  
8834    False
8835    False
8836    False
8837    False
8838    False
Name: imdbid, Length: 8839, dtype: bool

We can use the boolean series to index directly:

In [24]:
data.loc[data.imdbid.isna(), :]

Unnamed: 0,year,imdbid,rating,title,id
7602,2015,,3,"Danish Girl , The",9081
8208,2017,,3,Wonder Woman,9294
8280,2017,,3,Wonder Woman,9293
8545,2019,,3,"Rise Of Skywalker, The",9098


But we will save the nan indeces since we will need them later too:

In [25]:
nan_idx = data.loc[data.imdbid.isna(), :].index
nan_idx

Int64Index([7602, 8208, 8280, 8545], dtype='int64')

In the second cell, we used .loc[ ] to take a specific set of rows and columns of the dataframe. In our case, only those rows that have True for *.isna()* and selected all columns (:). If we wanted multiple columns, say, year and title, just passed them as a list-like object:

In [26]:
data.loc[nan_idx, ['year', 'title']]

Unnamed: 0,year,title
7602,2015,"Danish Girl , The"
8208,2017,Wonder Woman
8280,2017,Wonder Woman
8545,2019,"Rise Of Skywalker, The"


If you wanted only one column, but still keep the output a dtaframe, pass the single column as alist. If you don't mind having a pandas Series as an output (the 1D data structure equivalent to a dataframe, with its own methods), pass the name  by itself:

In [27]:
data.loc[nan_idx, ['title']]

Unnamed: 0,title
7602,"Danish Girl , The"
8208,Wonder Woman
8280,Wonder Woman
8545,"Rise Of Skywalker, The"


In [28]:
data.loc[nan_idx, 'title']

7602         Danish Girl , The
8208              Wonder Woman
8280              Wonder Woman
8545    Rise Of Skywalker, The
Name: title, dtype: object

*.loc[ ]* is great to get data from the dataframe if you have a list of indeces or columns you want to access or some type of boolean array like the one we get from *.isna()*. If instead you wanted to use the specific numeric position (*e.g.* the 50th to 60th row, 2rd to 4th column), you can use *.iloc[]* instead:

In [29]:
data.iloc[50:63:3, 2:5]

Unnamed: 0,rating,title,id
50,0,Good Glue Sticks,5615
53,1,Le barometre de la fidelite,6211
56,0,A Trip to Mars,5685
59,2,Cleopatra,2003
62,0,"Voyage of the Bourrichon Family, The",5364


### Dealing with NaNs
#### .dropna, .fillna, .rolling
So, anyway, back to our Nans:

In [30]:
data.loc[nan_idx, :]

Unnamed: 0,year,imdbid,rating,title,id
7602,2015,,3,"Danish Girl , The",9081
8208,2017,,3,Wonder Woman,9294
8280,2017,,3,Wonder Woman,9293
8545,2019,,3,"Rise Of Skywalker, The",9098


Depending on the nature of your NaNs, how many you have, the data it should represent, etc there are a millions things one can do with NaNs. The first and simplest if you are willing and able to do without some data is eithre dropping the whole row or column containint nans, with *.dropna*. Look at the number or rows of the output compared to what we had before:

In [31]:
data.dropna()

Unnamed: 0,year,imdbid,rating,title,id
0,1888,0392728,0,Roundhay Garden Scene,8040
1,1892,0000003,0,Pauvre Pierrot,5433
2,1895,0132134,0,"Execution of Mary, Queen of Scots, The",6200
3,1895,0000014,0,Tables Turned on the Gardener,5444
4,1896,0000131,0,Une nuit terrible,5406
...,...,...,...,...,...
8834,2021,5144174,3,"Dry, The",9498
8835,2021,10919362,3,Sweetheart,9505
8836,2021,10813940,2,Ginny and Georgia,9501
8837,2021,5109280,3,Raya and the Last Dragon,9504


The way we ran *.dropna*, the changes were not actually applied to the dataframe. If we wanted to keep the new NaN-less data, we could save it as a new variable, overwrite our current dataframe:

clean_data = data.dropna()

data = data.dropna()

Another option (which I don't like personally so you won't see much of it here) is to use the **inplace** keyword. *inplace* apears in several pandas functions and essentially overwrites the dataframe without actually having to use data = data.dropna():

data.dropna(inplace=True)

For pros and (mainly) cons of using *inplace*, I encourage looking at this [stackoverflow discussion](https://stackoverflow.com/questions/45570984/in-pandas-is-inplace-true-considered-harmful-or-not)


Since we don't want to drop any rows or columns we won't be using *.dropna()*, but instead fill the missing values. A very helpful function if you decide/need to go this way is the aptly named *.fillna()*. */fillna* offers some great functionality to solve most cases: You can either pass a single value which will replace all Nans, or you could use on of their predefined methods to repeat the last know measure ('ffill') or repeat the next known measure ('bfill'):

In [32]:
data.loc[nan_idx, :].fillna(-99)

Unnamed: 0,year,imdbid,rating,title,id
7602,2015,-99,3,"Danish Girl , The",9081
8208,2017,-99,3,Wonder Woman,9294
8280,2017,-99,3,Wonder Woman,9293
8545,2019,-99,3,"Rise Of Skywalker, The",9098


Fill with the mean

In [33]:
data.fillna(method='ffill').loc[nan_idx.union(nan_idx - 1), :]

Unnamed: 0,year,imdbid,rating,title,id
7601,2015,4428814,1,La loi du march&eacute;,6279
7602,2015,4428814,3,"Danish Girl , The",9081
8207,2017,5155780,3,"Discovery, The",7581
8208,2017,5155780,3,Wonder Woman,9294
8279,2017,7341676,3,OM+ME,7895
8280,2017,7341676,3,Wonder Woman,9293
8544,2019,5363618,1,Sound of Metal,9448
8545,2019,5363618,3,"Rise Of Skywalker, The",9098


In the first one we see that the same value of -99 was applied to all NaNs (and that is all NaNs in the dataframe, not percolumn) and the second example has diffferent values corresponding to the id of the last not-nan value.

To get the rows with nans, instead of using the actual integer indeces (which you could do with both .loc[ ] instead of .iloc[ ] because the index is the row number as well) I used the known indeces together with the indexes of the values before them. This is mainly to avoid having to write them manually since the bechdel data webiste gets updated frequently which would misalign the indeces between the hard-coded values and the real position of the NaNs across updates.

That is one of the most basic ways to fill NaNs, but you can get fancier if you ened to. Imagine for example you wanted to replace the nans with a rolling average (averaging over the nearest data points instead of the whole average) You could do something like this:

In [34]:
data['imdbid'].rolling(window=6, min_periods=1).mean()

0       3.927280e+05
1       1.963655e+05
2       1.749550e+05
3       1.312198e+05
4       1.050020e+05
            ...     
8834    7.956427e+06
8835    9.595310e+06
8836    9.883262e+06
8837    8.962566e+06
8838    8.601042e+06
Name: imdbid, Length: 8839, dtype: float64

That last piece of code can be extremely useful. Essentially, we say to have a sliding window of size 6 (the number of elements that are considered at one time) and the minimum number of elements required to use (in case you have many NaNs one after the other for example). On it we then simply apply an aggregating function, in this case the mean. As always, I encourage you to check the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html) to learn more functionality about the rolling function.

And, to replace the actual values with that rolling mean, you could do something like:

data['imdbid'] = data['imdbid'].fillna(data['imdbid'].rolling(window=6, min_periods=1).mean())

None of these solutions are good for us here though, since one can't really calculate a given movie's imdb-id from other movies imdb-id. So in our case, to replace the NaNs we will have to look them up manually. Before doing that though, let's look again at the movies missing imdb ids:

In [35]:
data.loc[nan_idx, :]

Unnamed: 0,year,imdbid,rating,title,id
7602,2015,,3,"Danish Girl , The",9081
8208,2017,,3,Wonder Woman,9294
8280,2017,,3,Wonder Woman,9293
8545,2019,,3,"Rise Of Skywalker, The",9098


You'll see that Wonder Woman 2017 appears there twice. That's a great opportunity to explore the second most common data cleaning process: dealing with duplicattes:

###  Dealing with Duplicates
#### .duplicated, .drop, .reset_index
First, it is good to know if there are any more duplicates:

In [36]:
data[data.duplicated(subset=['year', 'title'], keep=False)]

Unnamed: 0,year,imdbid,rating,title,id
182,1931,21814.0,2,Dracula,1985
192,1931,21815.0,3,Dracula,8213
839,1959,53285.0,3,Sleeping Beauty,9209
848,1959,53285.0,3,Sleeping Beauty,474
1783,1983,86425.0,3,Terms of Endearment,4449
1792,1983,86425.0,1,Terms of Endearment,4448
2915,1997,117056.0,3,Ayneh,4380
2998,1997,117056.0,3,Ayneh,4381
6010,2011,2043900.0,3,Last Call at the Oasis,4889
6092,2011,2043900.0,3,Last Call at the Oasis,4907


 As with isna(), the *.duplicated* function returns a Series of booleans specifying if the row has a duplicate or not. The *subset* parameter allows to consider only certain columns for repeats. In our case, we asked to consider a duplicate anything that shares the same title AND year. That way, we don't pick up any reboots or movies with the same title. The *keep* parameter is used to determine which duplicate to pick. With False, essentially we ask to return all duplicate copies. We could also use the 'first' or 'last' option if we only wanted to consider the first or last occurence of the duplicates respectively.
 
If we didn't care much about the result above, we could quickly remove duplicates using the aptly named function *drop_duplicates*

In [37]:
data.drop_duplicates(subset=['year', 'title'], keep='last')

Unnamed: 0,year,imdbid,rating,title,id
0,1888,0392728,0,Roundhay Garden Scene,8040
1,1892,0000003,0,Pauvre Pierrot,5433
2,1895,0132134,0,"Execution of Mary, Queen of Scots, The",6200
3,1895,0000014,0,Tables Turned on the Gardener,5444
4,1896,0000131,0,Une nuit terrible,5406
...,...,...,...,...,...
8834,2021,5144174,3,"Dry, The",9498
8835,2021,10919362,3,Sweetheart,9505
8836,2021,10813940,2,Ginny and Georgia,9501
8837,2021,5109280,3,Raya and the Last Dragon,9504


That function is not very different from *.dropna* before and *.duplicated* right above. We essentially asked to consider only the year and title columns (instead of the whole row) and in this case, we are keeping the last of all the duplicate sets. If we wanted to save the dataset with dropped duplicates, we could use the *inplace* parameter or reassign with data = data.drop_duplicates(...).

It was good to learn to use the *drop_duplicates* method, however, at least at the time that I wrote this, it was more useful to manually inspect all copies of the duplicated entries tospot some important things to clean. Of course, in other datasets this might not be possible, but since we can do it here, let's take the time:

In [38]:
data[data.duplicated(subset=['year', 'title'], keep=False)]

Unnamed: 0,year,imdbid,rating,title,id
182,1931,21814.0,2,Dracula,1985
192,1931,21815.0,3,Dracula,8213
839,1959,53285.0,3,Sleeping Beauty,9209
848,1959,53285.0,3,Sleeping Beauty,474
1783,1983,86425.0,3,Terms of Endearment,4449
1792,1983,86425.0,1,Terms of Endearment,4448
2915,1997,117056.0,3,Ayneh,4380
2998,1997,117056.0,3,Ayneh,4381
6010,2011,2043900.0,3,Last Call at the Oasis,4889
6092,2011,2043900.0,3,Last Call at the Oasis,4907


1. The Dracula (1931) movie entries don't share their imdb ids.
    * Are they two different Dracula movies from 1931? Not likely, need to check the imdb id manually even if not a NaN.
2. One of the Sleeping Beauty (1959) imdb ids is missing a couple of zeros (or the other has extra zeros)
    * Check id manually to decide which one to keep
3. Terms of Endearment (1983) shares the ids but not the bechdel rating
    * Watching the movie and deciding the rating would be the best thing to do. A quick wiki search though tells us the movie covers 30 years of the relationship between a mother and a daughter so I'll assume that the rating of 1 can be dropped
4. Ayneh (1997) and Last Call at the Oasis (2011) seem to be ok. Differnt ids in their database, so true duplicates here
    * Drop any of the two
5. Into the Woods (2014) also missing (or extra) zeros on the imdbid
    * Check id manually to decide which one to keep
6. Wonder Woman (2017) has 3 entries, two with NaNs for the imdbid and one with an id
    * Worth checking the imdb id present and drop the NaN rows
    
Remember again, if you are trying to replicate these results at a later date, these might be fixed already and/or new ones might appear. Deal with them at your own discretion.
    
Perfoming all of these manual checks, we decide to drop the following rows (by row index). Be mindful that if you run the next block of code more than once it will throw an error because the rows would already have been dropped by the second run.

1. We drop the one iwth imdb not 0021814
2. Drop the one without the double 00s
3. Drop the oen with rating of 1
4. drop any of the two
5. Drop the one with extra zeros

In [39]:
to_drop = [192, 848, 1792, 2998, 6010, 7238, 8208, 8280]
data = data.drop(to_drop)

The *.drop* function not only serves to drop rows by index, but also columns by name (if the axis is specified) and has other useful options worth reading in the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

Since we dropped some rows, it would be clean to also reset the index (so it doesn't skip any numbers as in ..., 181, 183, ...). For that we use the *reset_index* function:

In [40]:
data = data.reset_index(drop=True)
data

Unnamed: 0,year,imdbid,rating,title,id
0,1888,0392728,0,Roundhay Garden Scene,8040
1,1892,0000003,0,Pauvre Pierrot,5433
2,1895,0132134,0,"Execution of Mary, Queen of Scots, The",6200
3,1895,0000014,0,Tables Turned on the Gardener,5444
4,1896,0000131,0,Une nuit terrible,5406
...,...,...,...,...,...
8826,2021,5144174,3,"Dry, The",9498
8827,2021,10919362,3,Sweetheart,9505
8828,2021,10813940,2,Ginny and Georgia,9501
8829,2021,5109280,3,Raya and the Last Dragon,9504


The *drop=True* there serves to completely drop the old index instead of creating a new column with it and then resetting the index

### Aside - Finish filling NaNs
Now that we dealt with the duplicates, we can go back to filling that NaN imdb ids which we left before because of the "Wonder Woman (2017)" double entry. Quick search of the two movies shouldn't be hard  (apprently is in the url of the page itself when you search the movie on imdb):

In [41]:
nan_idx = data.loc[data.imdbid.isna()].index
data.loc[nan_idx]

Unnamed: 0,year,imdbid,rating,title,id
7596,2015,,3,"Danish Girl , The",9081
8537,2019,,3,"Rise Of Skywalker, The",9098


This allows me to present another way of using the fillna function, which is to provide the index and the new value as a dictionary:

In [42]:
missing_ids = {nan_idx[0]:'0810819', nan_idx[1]:'2527338'}
data.imdbid = data.imdbid.fillna(missing_ids)
data.loc[nan_idx, :]

Unnamed: 0,year,imdbid,rating,title,id
7596,2015,810819,3,"Danish Girl , The",9081
8537,2019,2527338,3,"Rise Of Skywalker, The",9098


##  String columns - Processing badly written text
#### .str, .str.replace

Another issue you might have noticed so far is the apeareance of a weird series of characters (\&#39;) were apostrophes (') should've been. This is a cool opportunity to explore anoter functionality of pandas that comes to use often,: the *.str*.

*.str* is a good way to work with strings in pandas Series (the 1D Dataframe if I haven't mentioned it already). Let's take the example of correcting the weird sequence of characters. Let's first identify the rows that have it with *str.contains*

In [43]:
seq = '&#39;'
idx = data[data.title.str.contains(seq)].index
data.title[idx]

11      Astronomer&#39;s Dream; or, The Man in the Moo...
15                         Hamlet ( Le Duel d&#39;Hamlet)
24                            Grandma&#39;s Reading Glass
26                    L&#39;homme a la tete en caoutchouc
29                    Mephistopheles&#39; School of Magic
                              ...                        
8704                       Where&#39;d You Go, Bernadette
8744         Mariah Carey&#39;s Magical Christmas Special
8775                                   I&#39;m Your Woman
8776                         Ma Rainey&#39;s Black Bottom
8825                                   Finding &#39;Ohana
Name: title, Length: 392, dtype: object

That is quite a lot of them, but we can quickly take care of them with *.str.replace*.

In [44]:
data.title = data.title.str.replace(seq, "'")
data.title[idx]

11      Astronomer's Dream; or, The Man in the Moon, The
15                            Hamlet ( Le Duel d'Hamlet)
24                               Grandma's Reading Glass
26                       L'homme a la tete en caoutchouc
29                       Mephistopheles' School of Magic
                              ...                       
8704                          Where'd You Go, Bernadette
8744            Mariah Carey's Magical Christmas Special
8775                                      I'm Your Woman
8776                            Ma Rainey's Black Bottom
8825                                      Finding 'Ohana
Name: title, Length: 392, dtype: object

That seems to be all we need to do for the title column in terms of *.str*. That being said, there is tons more of fcuntionality packed with *.str* so I strongly adivce that whenever you need vectorized (*i.e.* fast, non-loopy) string operations on pandas columns, think of *.str* first!

That is it for the cleaning side of things. Hopefully now not only we have a good clean dataset to work with, but you have also learned useful tools to be able to perform the same type of proces to your own datasets, whatever they might be. The last thing left to do at this point, is to save a copy of the clean data, so we can directly access it when analysing the data without having to go through all of this trouble again:

In [45]:
data.to_csv(bechdel_clean_path, index=False, sep='\t', compression='gzip')