# Movie recommendation 🍿

Main motive of this notebook is to build a content based recommendation engine. During the process of development, we will also explore and play around with SpaCy3 library.

# Lets understand spaCy3

spaCy is a **free, open-source library** for advanced Natural Language Processing (NLP) in Python.

spaCy can be used to **build information extraction** or **natural language understanding systems**, or to **pre-process text** for deep learning.

Source materials : *[INTRODUCTION TO SPACY 3](https://spacy.pythonhumanities.com/intro.html), [Spacy.io Official](https://spacy.io/api) and [freeCodeCamp.org](https://www.youtube.com/watch?v=dIUTsFT2MeQ&t=5473s&ab_channel=freeCodeCamp.org)*

## spaCy vs. NLTK

| Feature | spaCy | NLTK |
|---|---|---|
| Approach | Pre-trained models, object-oriented | Toolkit of components |
| Suitability | Developers, production | Researchers, customization |
| Performance | Faster | Slower |
| Languages | Limited (English, German, etc.) | Wider range |
| Word Vectors | Supported | Not directly supported |

## spaCy architecture

![](https://spacy.io/images/architecture.svg)

### Architecture in nutshell

* Initialize nlp object
* Pass the text data to nlp object to create doc object
* During the initialization of doc object, spaCy pipelines will be ran and stored in the doc object.
* From the doc object, we can access containers like vocab, token, sentences etc.

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk 

In [22]:
# Loading Spacy3
import spacy
nlp = spacy.load("en_core_web_sm")
# nlp.analyze_pipes()

In [47]:
def basic_preprocessing(df):
    
    df = df.copy()
    
    # Converting gross column to numeric
    df.Gross = df.Gross.apply(lambda x: x.replace(",", "") if isinstance(x, str) else x)
    df.Gross = df.Gross.apply(lambda x: int(x) if isinstance(x, str) else x)
    
    # Converting runtime to numeric
    df['Runtime_in_mins'] = df.Runtime.apply(lambda x: int(x.replace("min","")))

    # Droping poster link column
    df = df.drop(['Poster_Link','Runtime'],axis=1)
    
    return(df)

def data_quality_checks(df):
    data_quality_df = pd.DataFrame({
    "missing_count" : imdb_df.isna().sum(),
    "missing_perc" : round(imdb_df.isna().sum()/len(imdb_df)*100,2)}
    "duplicated"
    )
    data_quality_df.missing_perc = data_quality_df.missing_perc.astype(str) + " %"
    data_quality_df

In [46]:
imdb_df = pd.read_csv("/kaggle/input/imdb-dataset-of-top-1000-movies-and-tv-shows/imdb_top_1000.csv")
imdb_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [48]:
imdb_df = basic_preprocessing(imdb_df)

In [49]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Series_Title     1000 non-null   object 
 1   Released_Year    1000 non-null   object 
 2   Certificate      899 non-null    object 
 3   Genre            1000 non-null   object 
 4   IMDB_Rating      1000 non-null   float64
 5   Overview         1000 non-null   object 
 6   Meta_score       843 non-null    float64
 7   Director         1000 non-null   object 
 8   Star1            1000 non-null   object 
 9   Star2            1000 non-null   object 
 10  Star3            1000 non-null   object 
 11  Star4            1000 non-null   object 
 12  No_of_Votes      1000 non-null   int64  
 13  Gross            831 non-null    float64
 14  Runtime_in_mins  1000 non-null   int64  
dtypes: float64(3), int64(2), object(10)
memory usage: 117.3+ KB


In [45]:
imdb_df = basic_preprocessing(imdb_df)