# Assignment: Collecting and Cleaning Movie Data from TMDB API

## Objective
Collect movie data from The Movie Database (TMDB) API and clean the text data for Natural Language Processing (NLP).

## Task Overview
1. **Data Collection:**
   - Collect movie data (title, overview, and genre) from the TMDB API for pages 1 to 471.
     - API Endpoint: `https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page=1`
   - Collect genre data using the genre API endpoint.
     - API Endpoint: `https://api.themoviedb.org/3/genre/movie/list?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US`

2. **Data Cleaning:**
   - Clean the collected text data to prepare it for NLP tasks.

## Detailed Steps

### Step 1: Data Collection
1. **Set Up:**
   - Use Python to interact with the TMDB API.
   - Install the required libraries:
     ```bash
     pip install requests pandas
     ```

2. **Collect Movie Data:**
   - Use the provided URL to access the top-rated movies. Loop through pages 1 to 471 to collect all movie data.
   - Extract the movie title, overview, and genre IDs.

3. **Collect Genre Data:**
   - Use the provided genre API URL to get the genre IDs and names.

4. **Store Data:**
   - Store the collected data in a structured format, such as a CSV file or a Pandas DataFrame.

### Step 2: Data Cleaning
1. **Text Cleaning:**
   - Remove punctuation, special characters, and numbers from the movie titles and overviews.
   - Convert all text to lowercase.
   - Remove stop words (e.g., 'the', 'and', 'is', etc.).
   - Handle missing data appropriately.

2. **Tokenization:**
   - Tokenize the movie overviews into words or phrases.

3. **Stemming/Lemmatization:**
   - Apply stemming or lemmatization to reduce words to their root forms.

4. **Handling Genre Data:**
   - Replace genre IDs with their corresponding genre names using the genre data collected.

### Hints and Tips
- **API Requests:**
  - Use a loop to iterate through the pages of the movie API.
  - Use the `requests` library to make HTTP requests to the TMDB API.
  - Handle API rate limits by adding appropriate delays between requests if necessary.

- **Text Cleaning:**
  - Use the `re` library for regular expressions to clean text.
  - Utilize `nltk` or `spacy` libraries for advanced text processing, such as stop word removal and lemmatization.

- **Data Storage:**
  - Use the `pandas` library to create DataFrames and easily manipulate the data.
  - Save your cleaned data to a CSV file for further analysis.

### Submission
- Submit the code used to collect and clean the movie data.
- Include the cleaned movie data in a structured format (e.g., CSV file or Pandas DataFrame).
- Provide a brief explanation of the data cleaning steps taken.
- You can also include any additional analysis or insights derived from the cleaned data.

### Example
Here is an example of the cleaned movie data in a tabular format:

| Title          | Overview                                      | Genre          |
|----------------|-----------------------------------------------|----------------|
| Movie Title 1  | Cleaned overview text 1                       | Genre Name 1   |
| Movie Title 2  | Cleaned overview text 2                       | Genre Name 2   |
| ...            | ...                                           | ...            |


# **1) Data Collection**

In [494]:
# import library
import pandas as pd

In [495]:
# load movies data
movies_df = pd.read_json("https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page=1")
movies_df.head()

Unnamed: 0,page,results,total_pages,total_results
0,1,"{'adult': False, 'backdrop_path': '/avedvodAZU...",478,9558
1,1,"{'adult': False, 'backdrop_path': '/tmU7GeKVyb...",478,9558
2,1,"{'adult': False, 'backdrop_path': '/kGzFbGhp99...",478,9558
3,1,"{'adult': False, 'backdrop_path': '/zb6fM1CX41...",478,9558
4,1,"{'adult': False, 'backdrop_path': '/qqHQsStV6e...",478,9558


In [496]:
# number of row and columns in movies dataset
r, c = movies_df.shape
print("Rows = ", r)
print("Columns = ", c)

Rows =  20
Columns =  4


In [497]:
# view results column
movies_df = movies_df["results"]
movies_df.head()

0    {'adult': False, 'backdrop_path': '/avedvodAZU...
1    {'adult': False, 'backdrop_path': '/tmU7GeKVyb...
2    {'adult': False, 'backdrop_path': '/kGzFbGhp99...
3    {'adult': False, 'backdrop_path': '/zb6fM1CX41...
4    {'adult': False, 'backdrop_path': '/qqHQsStV6e...
Name: results, dtype: object

In [498]:
# data at first row of result column
movies_df[0]

{'adult': False,
 'backdrop_path': '/avedvodAZUcwqevBfm8p4G2NziQ.jpg',
 'genre_ids': [18, 80],
 'id': 278,
 'original_language': 'en',
 'original_title': 'The Shawshank Redemption',
 'overview': 'Imprisoned in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope.',
 'popularity': 150.813,
 'poster_path': '/9cqNxx0GxF0bflZmeSMuL5tnGzr.jpg',
 'release_date': '1994-09-23',
 'title': 'The Shawshank Redemption',
 'video': False,
 'vote_average': 8.705,
 'vote_count': 26607}

In [499]:
# extracting title, overview and genre_ids from results column

title_list = []
overview_list = []
genre_ids_list = []

for i in range(movies_df.shape[0]):
    title_list.append(movies_df[i]["title"])
    overview_list.append(movies_df[i]["overview"])
    genre_ids_list.append(movies_df[i]["genre_ids"])

movies_df = pd.DataFrame(
    {
        "title": title_list,
        "overview": overview_list,
        "genre_ids": genre_ids_list
    }
)

movies_df.head()

Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[18, 80]"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","[18, 80]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[18, 80]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[18, 36, 10752]"
4,12 Angry Men,The defense and the prosecution have rested an...,[18]


In [500]:
# load genres data from api
genres_df = pd.read_json("https://api.themoviedb.org/3/genre/movie/list?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US")
genres_df.head()

Unnamed: 0,genres
0,"{'id': 28, 'name': 'Action'}"
1,"{'id': 12, 'name': 'Adventure'}"
2,"{'id': 16, 'name': 'Animation'}"
3,"{'id': 35, 'name': 'Comedy'}"
4,"{'id': 80, 'name': 'Crime'}"


In [501]:
# extaract genre_id and genre_name from genres data

genre_id_list = []
genres_name_list = []

for i in range(len(genres_df)):
    genre_id_list.append(genres_df["genres"][i]["id"])
    genres_name_list.append(genres_df["genres"][i]["name"])

genres_df = pd.DataFrame({
    "genre_id": genre_id_list,
    "genre_name": genres_name_list
})

genres_df.head(18)

Unnamed: 0,genre_id,genre_name
0,28,Action
1,12,Adventure
2,16,Animation
3,35,Comedy
4,80,Crime
5,99,Documentary
6,18,Drama
7,10751,Family
8,14,Fantasy
9,36,History


In [502]:
# function to find genre name of given id

def findName(id):

    return list(genres_df["genre_id"]).index(id)

In [503]:
# extract every id from genre_ids column and search name of extracted id, then
# replace id with its respective name 

for i in range(len(movies_df["genre_ids"])):
    for j in range(len(movies_df["genre_ids"][i])):
        id = movies_df["genre_ids"][i][j]
        index = findName(id)
        movies_df["genre_ids"][i][j] = genres_df["genre_name"][index]

movies_df.head()

Unnamed: 0,title,overview,genre_ids
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[Drama, Crime]"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","[Drama, Crime]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[Drama, Crime]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[Drama, History, War]"
4,12 Angry Men,The defense and the prosecution have rested an...,[Drama]


In [504]:
# rename gener_ids column to genres, because now it contains genre_name not genre_id
movies_df.columns = movies_df.columns.str.replace("genre_ids", "genres")
movies_df.head()

Unnamed: 0,title,overview,genres
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"[Drama, Crime]"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","[Drama, Crime]"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"[Drama, Crime]"
3,Schindler's List,The true story of how businessman Oskar Schind...,"[Drama, History, War]"
4,12 Angry Men,The defense and the prosecution have rested an...,[Drama]


In [505]:
# save collected data into csv file
movies_df.to_csv("movies.csv", index=False)

In [506]:
# load data from csv file
movies_df = pd.read_csv("movies.csv")
movies_df.shape

(20, 3)

In [507]:
# top five row of dataset
movies_df.head()

Unnamed: 0,title,overview,genres
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"['Drama', 'Crime']"
1,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...","['Drama', 'Crime']"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"['Drama', 'Crime']"
3,Schindler's List,The true story of how businessman Oskar Schind...,"['Drama', 'History', 'War']"
4,12 Angry Men,The defense and the prosecution have rested an...,['Drama']


# **2) Data Cleaning**

### **Removing punctuations and special characters**

In [508]:
import string

exclude = string.punctuation

In [509]:
def remove_punctuation(text):
    for i in exclude:
        text = text.replace(i, '')
    return text

In [510]:
movies_df['title'] = movies_df['title'].apply(lambda x: remove_punctuation(x))
movies_df['overview'] = movies_df['overview'].apply(lambda x: remove_punctuation(x))
movies_df.head()

Unnamed: 0,title,overview,genres
0,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,"['Drama', 'Crime']"
1,The Godfather,Spanning the years 1945 to 1955 a chronicle of...,"['Drama', 'Crime']"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"['Drama', 'Crime']"
3,Schindlers List,The true story of how businessman Oskar Schind...,"['Drama', 'History', 'War']"
4,12 Angry Men,The defense and the prosecution have rested an...,['Drama']


### **Removing numbers from data** 

In [511]:
numbers = string.digits

In [512]:
def remove_numbers(text):
    for i in numbers:
        text = text.replace(i, '')
    return text

In [513]:
movies_df['title'] = movies_df['title'].apply(lambda x: remove_numbers(x))
movies_df['overview'] = movies_df['overview'].apply(lambda x: remove_numbers(x))
movies_df.head()

Unnamed: 0,title,overview,genres
0,The Shawshank Redemption,Imprisoned in the s for the double murder of h...,"['Drama', 'Crime']"
1,The Godfather,Spanning the years to a chronicle of the fic...,"['Drama', 'Crime']"
2,The Godfather Part II,In the continuing saga of the Corleone crime f...,"['Drama', 'Crime']"
3,Schindlers List,The true story of how businessman Oskar Schind...,"['Drama', 'History', 'War']"
4,Angry Men,The defense and the prosecution have rested an...,['Drama']


### **Convert data into lowercase**

In [514]:
movies_df["title"] = movies_df["title"].str.lower()
movies_df["overview"] = movies_df["overview"].str.lower()
movies_df["genres"] = movies_df["genres"].str.lower()

movies_df.head()

Unnamed: 0,title,overview,genres
0,the shawshank redemption,imprisoned in the s for the double murder of h...,"['drama', 'crime']"
1,the godfather,spanning the years to a chronicle of the fic...,"['drama', 'crime']"
2,the godfather part ii,in the continuing saga of the corleone crime f...,"['drama', 'crime']"
3,schindlers list,the true story of how businessman oskar schind...,"['drama', 'history', 'war']"
4,angry men,the defense and the prosecution have rested an...,['drama']


### **Removing Stopwords**

In [515]:
from nltk.corpus import stopwords

stop = stopwords.words('english')

In [516]:
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word not in stop:
            new_text.append(word)
    return " ".join(new_text)

In [517]:
movies_df['title'] = movies_df['title'].apply(lambda x: remove_stopwords(x))
movies_df['overview'] = movies_df['overview'].apply(lambda x: remove_stopwords(x))

movies_df.head()

Unnamed: 0,title,overview,genres
0,shawshank redemption,imprisoned double murder wife lover upstanding...,"['drama', 'crime']"
1,godfather,spanning years chronicle fictional italianamer...,"['drama', 'crime']"
2,godfather part ii,continuing saga corleone crime family young vi...,"['drama', 'crime']"
3,schindlers list,true story businessman oskar schindler saved t...,"['drama', 'history', 'war']"
4,angry men,defense prosecution rested jury filing jury ro...,['drama']


### **Handling Missing Values**

In [518]:
movies_df.isnull().sum()

title       0
overview    0
genres      0
dtype: int64

- **There is no missing value in this dataset**

### **Tokenization**

In [519]:
from nltk.tokenize import word_tokenize, sent_tokenize

def tokenization(text):
    
    tokenized_text = []
    
    sen_token = sent_tokenize(text)
    for i in sen_token:
        tokenized_text.append(word_tokenize(i))
        
    return tokenized_text

In [520]:
movies_df['title'] = movies_df['title'].apply(lambda x: tokenization(x))
movies_df['overview'] = movies_df['overview'].apply(lambda x: tokenization(x))
movies_df.head()

Unnamed: 0,title,overview,genres
0,"[[shawshank, redemption]]","[[imprisoned, double, murder, wife, lover, ups...","['drama', 'crime']"
1,[[godfather]],"[[spanning, years, chronicle, fictional, itali...","['drama', 'crime']"
2,"[[godfather, part, ii]]","[[continuing, saga, corleone, crime, family, y...","['drama', 'crime']"
3,"[[schindlers, list]]","[[true, story, businessman, oskar, schindler, ...","['drama', 'history', 'war']"
4,"[[angry, men]]","[[defense, prosecution, rested, jury, filing, ...",['drama']


### **Lemmatization**

In [521]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()


def lemmatize_text(text):
    
    lemmatized_text = []

    for words in text:
        for word in words:
            lemmatized_text.append(lemmatizer.lemmatize(word, pos='v'))

    return lemmatized_text

In [522]:
movies_df['title'] = movies_df['title'].apply(lambda x: lemmatize_text(x))
movies_df['overview'] = movies_df['overview'].apply(lambda x: lemmatize_text(x))
movies_df.head()

Unnamed: 0,title,overview,genres
0,"[shawshank, redemption]","[imprison, double, murder, wife, lover, upstan...","['drama', 'crime']"
1,[godfather],"[span, years, chronicle, fictional, italianame...","['drama', 'crime']"
2,"[godfather, part, ii]","[continue, saga, corleone, crime, family, youn...","['drama', 'crime']"
3,"[schindlers, list]","[true, story, businessman, oskar, schindler, s...","['drama', 'history', 'war']"
4,"[angry, men]","[defense, prosecution, rest, jury, file, jury,...",['drama']


- **Save preprocessed data into csv file**

In [523]:
movies_df.to_csv("preprocessed_movies_data.csv")