## Preprocessing Assignment
1. [X] Fetch API and create a dataset of movie name, overview, genres
2. [X] Apply suitable preprocessing on the dataset

### 1. Fetching data from API and creating dataset

In [None]:
%%time
import json
import requests
from tqdm import tqdm
import pandas as pd

df = pd.DataFrame(columns=['title', 'overview', 'genre_ids'])

# url of movies api
url = "https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page="

# fetch for 470 pages
for page in tqdm(range(470)):
  # GET request to api url
  response = requests.get(f'{url}{page+1}')

  # parse the response as JSON
  data = json.loads(response.text)

  # get results from JSON
  results = data["results"]

  for result in results:
    # extract title, overview, and genre ids from the result
    row = {"title": result["title"], "overview": result["overview"], "genre_ids": result["genre_ids"]}

    # create a new df and concat in df
    df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)

In [None]:
df.head()

#### Replace genre_ids with genre names

In [None]:
# URL of genre ids and its names
url = "https://api.themoviedb.org/3/genre/movie/list?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US"

response = requests.get(url)

data = json.loads(response.text)

results_dict = data["genres"]
# print(results_dict)

In [None]:
# function to eturn genre names for ids
def get_genre_name(genre_ids, results_dict):
  genre_names = []
  for genre_id in genre_ids:
    for result in results_dict:
      if result["id"] == genre_id:
        genre_names.append(result["name"])
  return ','.join(genre_names)


In [None]:
%%time
df['genres'] = df.apply(lambda x: get_genre_name(x['genre_ids'], results_dict), axis=1)
df.drop('genre_ids', axis=1, inplace=True)
df.head()

In [None]:
# save df to csv for later use
df.to_csv("./data/movies.csv", index = False)

### 2. Preprocessing

Which preprocessing operation needs to be done is guided by objective of our task at hand. Since the dataset is created from API response, the data is already clean, so many of preprocessing tasks like HTML Tag removal, URL removal, Chat word treatment, Spelling Correction, Emoji handling are not required to apply here. 

Preprocessing techniques generally needed for this dataset are (subject to end goal objective) -
- [x] Lowercasing
- [x] Punctuation Removal
- [x] Stop Word Removal
- [x] Tokenisation
- [x] Stemming
- [x] Lemmatization

In [None]:
# read saved csv
df = pd.read_csv("./data/movies.csv")
df.head()

#### Lowercasing

In [None]:
df['overview'] = df['overview'].str.lower()
df.head()

#### Punctuation removal
**Caution:** If we remove punctuation, all sentence will be merged in as a single sentence, so we can't apply sentence tokenisation later.

In [None]:
import string
exclude = string.punctuation

def remove_punc_fast(text):
    return text.translate(str.maketrans('', '', exclude))

df['overview'] = df['overview'].astype(str).apply(remove_punc_fast)
df['overview'][0]

#### Stop word removal

In [None]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords.words('english')

def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return ' '.join(x)

df['overview'] = df['overview'].astype(str).apply(remove_stopwords)
df.head()

#### Tokenization
- only word tokenisation as we have removed punctuation

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

df['token_words'] = df['overview'].apply(lambda x: word_tokenize(x))
df.head()

#### Porter Stemming

In [None]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

df['stem_words'] = df['overview'].astype(str).apply(stem_words)
df.head()

#### Lemmatisation

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize_words(words):
    lemmatized = []
    for word in words:
        lemmatized.append(wordnet_lemmatizer.lemmatize(word, pos='v'))
    return " ".join(lemmatized)

df['lemma_words'] = df['token_words'].apply(lemmatize_words)
df.head()

In [None]:
# create a df to see where lemma words are different from stem_words
df['diff_cols'] = df.apply(lambda x: x['stem_words'] != x['lemma_words'], axis=1)
diff_df = df[df['diff_cols'] == True]
df.drop('diff_cols', axis=1, inplace=True)
diff_df[['stem_words', 'lemma_words']]

### Unapplicable Preprocessing Operations
- HTML tag removal
- URL removal
- Chat word treatment
- Spelling correction
- Emoji handling

Here these preprocessing techniques are unapplicable still for learning purpose we can apply

In [None]:
# read saved csv
df = pd.read_csv("./data/movies.csv")
df.head()

In [None]:
# HTML tags removal
import re

def remove_html(text):
    return re.sub(r'<.*?>', '', text)

df['overview'] = df['overview'].astype(str).apply(remove_html)

In [None]:
# URL removal
def remove_url(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

df['overview'] = df['overview'].apply(remove_url)
df.head()

In [None]:
# Chat word treatment

# create dict of slang
chat_words = """AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you (also a chat program)
ILU=ILU: I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My A.. Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My A.. Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The F...
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait...
7K=Sick:-D Laugher
TFW = That feeling when. TFW internet slang often goes in a caption to an image.
MFW = My face when
MRW = My reaction when
IFYP = I feel your pain
LOL = Laughing out loud
TNTL = Trying not to laugh
JK = Just kidding
IDC = I don’t care
ILY = I love you
IMU = I miss you
ADIH = Another day in hell
IDC = I don’t care
ZZZ = Sleeping, bored, tired
WYWH = Wish you were here
TIME = Tears in my eyes
BAE = Before anyone else
FIMH = Forever in my heart
BSAAW = Big smile and a wink
BWL = Bursting with laughter
LMAO = Laughing my a** off
BFF = Best friends forever
CSL = Can’t stop laughing"""

word_pairs = chat_words.split("\n")

chat_dict = {}
for pair in word_pairs:
    key, value = pair.split('=')
    chat_dict[key] = value
# chat_dict

In [None]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_dict:
            new_text.append(chat_dict[w.upper()])
        else:
            new_text.append(w)
    return ' '.join(new_text)

df['overview'] = df['overview'].astype(str).apply(chat_conversion)
df.head()

In [None]:
# Spelling Correction
from textblob import TextBlob

def correct_spell(text):
    text_blob = TextBlob(text)
    return text_blob.correct().string

df['overview'] = df['overview'].astype(str).apply(correct_spell)
df.head()

In [None]:
# replace emoji - a better choice here
import emoji

df['overview'] = df['overview'].apply(lambda x: emoji.demojize(x))
df.head()

#### Errors:
- Faced some SSL Error in Jupyter Lab while fetching API

    **SSLEOFError: [SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)**
  
    So created the dataset by running the data creation part of code in `Colab` and downloaded CSV file.