# Creating and Importing Modules

The functions in `data_preparation.py` can be used in this notebook when they are imported, just like Pandas, NumPy or Matplotlib can be imported and used. The `__init__.py` file exists in the "tools" directory to inidicate that it is a package containing importable .py files. There is also an `__init__.py` file in the main repository that allows all files within the directory to import from the tools package.

In [1]:
import pandas as pd
from data_preparation import *

By using `from data_preparation import *`, all functions and defined constants are now accessible within an IPython Notebook file, without the need to use the module name or an alias. This could cause problems if function or class names overlap with other imported modules, but in this case it should be fine if we make sure to give functions unique and descriptive names. In the `data_preparation.py` file within the `tools` directory, I have defined a few functions to give examples.
___
First, an example of cleaning the Rotten Tomatoes review data without modularization:

In [2]:
# read Rotten Tomatoes reviews csv file
reviews_df = pd.read_csv("../data/rt.reviews.tsv", delimiter='\t', encoding='latin-1')
reviews_df.head()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


In [3]:
# Drop duplicate reviews when all features are present
reviews_df.drop_duplicates(inplace=True)

In [4]:
# Drop unnecessary columns
reviews_df.drop(['rating', 'publisher', 'critic'], axis=1, inplace=True)

In [5]:
# Cast date column as pd.datetime object
reviews_df['date'] = pd.to_datetime(reviews_df['date'])

In [6]:
# Change 'fresh' column to 1 if fresh, 0 if rotten
reviews_df['fresh'] = reviews_df['fresh'].map({'rotten': 0, 'fresh': 1})

In [7]:
reviews_df

Unnamed: 0,id,review,fresh,top_critic,date
0,3,A distinctly gallows take on contemporary fina...,1,0,2018-11-10
1,3,It's an allegory in search of a meaning that n...,0,0,2018-05-23
2,3,... life lived in a bubble in financial dealin...,1,0,2018-01-04
3,3,Continuing along a line introduced in last yea...,1,0,2017-11-16
4,3,... a perverse twist on neorealism...,1,0,2017-10-12
...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,1,1,2002-09-24
54428,2000,,0,0,2005-09-21
54429,2000,,0,0,2005-07-17
54430,2000,,0,0,2003-09-07


In [10]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54423 entries, 0 to 54431
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   id          54423 non-null  int64         
 1   review      48867 non-null  object        
 2   fresh       54423 non-null  int64         
 3   top_critic  54423 non-null  int64         
 4   date        54423 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 2.5+ MB


All of these lines of code are stored in the `data_preparation.py` within the `clean_rt_reviews` function, which takes the file path as a parameter and returns the cleaned pandas DataFrame. The paths are all in the file as well, stored as global constants which are accessible after importing


Now, using the function
___

In [8]:
df_with_function = clean_rt_reviews(RT_REVIEWS_PATH)

In [9]:
df_with_function

Unnamed: 0,id,review,fresh,top_critic,date
0,3,A distinctly gallows take on contemporary fina...,1,0,2018-11-10
1,3,It's an allegory in search of a meaning that n...,0,0,2018-05-23
2,3,... life lived in a bubble in financial dealin...,1,0,2018-01-04
3,3,Continuing along a line introduced in last yea...,1,0,2017-11-16
4,3,... a perverse twist on neorealism...,1,0,2017-10-12
...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,1,1,2002-09-24
54428,2000,,0,0,2005-09-21
54429,2000,,0,0,2005-07-17
54430,2000,,0,0,2003-09-07


In [11]:
df_with_function.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54423 entries, 0 to 54431
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   id          54423 non-null  int64         
 1   review      48867 non-null  object        
 2   fresh       54423 non-null  int64         
 3   top_critic  54423 non-null  int64         
 4   date        54423 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 2.5+ MB
