## Exploratory Data Analysis of Movies Database and Predicting Movies Successful or Not

## <center> <font color = red> RECOMMENDER SYSTEM </center>

<img src = 'https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true' width="240" height="360">

## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Data Loading and Description](#section2)
3. [Data Profiling](#section3)
    - 3.1 [Understanding the Dataset](#section301)<br/>
    - 3.2 [Pre Profiling](#section302)<br/>
    - 3.3 [Preprocessing](#section303)<br/>
    - 3.4 [Post Profiling](#section304)<br/>
4. [Questions](#section4)
    - 4.1 [How many feature play a key role in deciding what the final score?](#section401)<br/>
    - 4.2 [What type of correlation between features to predict match score](#section402)<br/>
5. [Preparing X (independent features) for the model building.](#section5)
    - 5.1 [Check for the type and shape of X.](#section501)<br/>
6. [Extract y (dependent variable) for model building.](#section5)
    - 6.1 [Check for the type and shape of y.](#section601)<br/>
7. [Split the value of X and y into train and test datasets](#section7)
8. [Features Scaling](#section8)
9. [Check the shape of X and y of train dataset.](#section9)
10. [Check the shape of X and y of test dataset.](#section10)
11. [Classification.](#section11)
    - 11.1 [Test Classifiers.](#section1101)<br/>
    - 11.2 [Principal Component Analysis.](#section1102)<br/>
    - 11.3 [Using One hot encoding.](#section1103)<br/>
12. [Evaluate the model](#section12)  
13. [Conclusions](#section13)<br/> 

<a id=section1></a>

### 1. Problem Statement

We know that presenting product to users with the most relevant users’ needs & preferences is an important task for any company to fulfill. To do this properly, we need to be able to extract their preferences from available raw data

Deducing interpretations from available raw data can be tricky, because to succeed we need to:

- __Understand what the users’ needs are__: We will typically only have very limited, implicit data of what a user might be interested in. For instance, Netflix needs to infer their users’ preferences of movies based on the movies they have watched previously. The users won’t explicitly tell Netflix what they like.

- __Prioritise all matches__: Even if a company like Netflix is able to satisfactorily model user preferences in movies, they still have a big problem: There are >50,000 movies out there of which thousands may fit with the user’s preferences. Which movies should Netflix recommend first?


A Recommender System employs a statistical algorithm that seeks to predict users' preferences for a particular entity, based on the similarity between the entities or similarity between the users that previously rated those entities. The intuition is that similar types of users are likely to have similar ratings for a set of entities.

We will analyse the two datasets in order to get all the informations about the users' preferences and movie to understand what the users’ needs and prioritise all matches. 

__In the end we will present product to users with the most relevant their’s needs & preferences using machine learning algorithms based on past data, Visualizations, Perspectives, etc.__



<a id=section2></a>

### 2. Data Loading and Description

We are going to analyze two datasets, tmdb_5000_credits and tmdb_5000_movies.

First one contains 4803 observations with the following columns:
- movie_id
- title
- cast
- crew

Second one contains 4803 observations with the following columns:
- budget, genres, homepage, id, keywords, original_language
- original_title, overview, popularity, production_companies
- production_countries, release_date, revenue, runtime
- spoken_languages, status, tagline, title, vote_average, vote_count

We will merge the two datasets in order to get all the informations about the actors and the directors of their relative movie.

The main problem with this dataset is the .json format. Many columns in the dataset are in json format, therefore cleaning the dataset was the main challenge. For people who don't know about JSON(JavaScript Object Notation), it is basically a syntax for storing and exchanging data between two computers. It is mainly in a key:value format, and is embedded into a string.

The datasets comprises of __4803 observations of 2 and 19 columns respetively__. Below is a table showing names of all the columns and their description.

| Column Name             | Description                                             |
| ------------------      |:-------------                                          :| 
| movie_id                | Identity of the moview                                  | 
| title                   | Title of the movie                                      |  
| cast                    | Contains information of the cast of each movie          | 
| crew                    | Contains information of the crew of each movie          |   

| Column Name             | Description                                              |
| -------------------     |:-------------                                           :| 
| id                      | Identity of the movie                                    | 
| budget                  | Movie budget                                             |  
| genres                  | Category of Movies                                       | 
| homepage                | Homepage of the movies                                   |   
| keywords                | Information about movie                                  |
| original_language       | Original language of movie                               |
| original_title          | Original title of movie                                  |
| overview                | Overview about movie                                     | 
| popularity              | Popularity of the movie                                  |
| production_companies    | Production companies                                     |
| production_countries    | Production countries                                     |
| release_date            | Release date of the movie                                |
| revenue                 | Revenue                                                  |
| runtime                 | Time duration in minute of the movie                     |
| spoken_languages        | Spoken languages in the movie                            |
| status                  | Status of the movie                                      |
| tagline                 | Tagline of the movie                                     |
| title                   | Movie Title                                              |
| vote_average            | Average rating                                           |
| vote_count              | Vote count                                               |

#### Source :
https://www.themoviedb.org/documentation/api


#### Importing packages                                          

In [11]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

%matplotlib inline
sns.set()

from subprocess import check_output

__Utils Functions__

In [12]:
from collections import defaultdict, Counter
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import mean_absolute_error
# from xgboost import XGBRegressor
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
import re
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

parameters = {'n_estimators': [4, 6, 9],
              'max_features': ['log2', 'sqrt', 'auto'],
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10],
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
              }

def Checker(x):
    if type(x) is pd.DataFrame:
        return 0
    elif type(x) is pd.Series:
        return 1
    else:
        return -1


def dtype(data):
    what = Checker(data)
    if what == 0:
        dtypes = data.dtypes.astype('str')
        dtypes = dtypes.str.split(r'\d').str[0]
    else:
        dtypes = str(data.dtypes)
        dtypes = re.split(r'\d', dtypes)[0]
    return dtypes


def split(x, pattern):
    '''Regex pattern finds in data and returns match. Then, it is splitted accordingly.
        \d = digits
        \l = lowercase alphabet
        \p = uppercase alphabet
        \a = all alphabet
        \s = symbols and punctuation
        \e = end of sentence
        '''
    pattern2 = pattern.replace('\d', '[0-9]').replace('\l', '[a-z]').replace('\p', '[A-Z]').replace('\a', '[a-zA-Z]')        .replace('\s', '[^0-9a-zA-Z]').replace('\e', '(?:\s|$)')

    if dtype(x) != 'object':
        print('Data is not string. Convert first')
        return False

    regex = re.compile(r'{}'.format(pattern))
    if pattern == pattern2:
        return x.str.split(pattern)
    else:
        return x.apply(lambda i: re.split(regex, i))


def replace(x, pattern, with_=None):
    '''Regex pattern finds in data and returns match. Then, it is replaced accordingly.
        \d = digits
        \l = lowercase alphabet
        \p = uppercase alphabet
        \a = all alphabet
        \s = symbols and punctuation
        \e = end of sentence
        '''
    if type(pattern) is list:
        d = {}
        for l in pattern:
            d[l[0]] = l[1]
        try:
            return x.replace(d)
        except:
            return x.astype('str').replace(d)

    pattern2 = pattern.replace('\d', '[0-9]').replace('\l', '[a-z]').replace('\p', '[A-Z]').replace('\a', '[a-zA-Z]')        .replace('\s', '[^0-9a-zA-Z]').replace('\e', '(?:\s|$)')

    if dtype(x) != 'object':
        print('Data is not string. Convert first')
        return False

    regex = re.compile(r'{}'.format(pattern))
    if pattern == pattern2:
        return x.str.replace(pattern, with_)
    else:
        return x.apply(lambda i: re.sub(regex, with_, i))


def hcat(*columns):
    cols = []
    for c in columns:
        if c is None:
            continue
        if type(c) in (list, tuple):
            for i in c:
                if type(i) not in (pd.DataFrame, pd.Series):
                    cols.append(pd.Series(i))
                else:
                    cols.append(i)
        elif type(c) not in (pd.DataFrame, pd.Series):
            cols.append(pd.Series(c))
        else:
            cols.append(c)
    return pd.concat(cols, 1)


def parse_col_json(df, column, key, nested):
    """
    Args:
        column: string
            name of the column to be processed.
        key: string
            name of the dictionary key which needs to be extracted
    """
    import json
    for index, i in zip(df.index, df[column].apply(json.loads)):
        list1 = []
        males = []
        females = []

        for j in range(len(i)):
            if nested:
                if not(((i[j]["department"] == "Directing") and (i[j]["job"] == "Director"))):
                    continue
            name = i[j][key]
            if "," in name:
                name = name.replace(",", " ")
            if " " in name:
                name = name.replace(" ", "_")
            list1.append(name)
            if column=="cast":
                if i[j]["gender"] == 1:
                    females.append(name)
                elif i[j]["gender"] == 2:
                    males.append(name)
        df.loc[index, column] = str(list1)
        if column=="cast":
            df.loc[index, "actors"] = str(males)
            df.loc[index, "actress"] = str(females)

def counts_elements(df, columns):
    d = defaultdict(Counter)
    for column in columns:
        for el in df[column]:
            l = eval(str(el))
            for x in l:
                d[column][x] += 1
    return d

def counts_vectorized(df, col, min=1, vocabulary=None):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(tokenizer=lambda x: x.split(
        ","), min_df=min, vocabulary=vocabulary)
    #analyze = vectorizer.build_analyzer()
    #f = analyze(succ_movies.cast.iloc[0].strip("[]"))
    data = [x.strip("[]") for x in df[col]]
    #analyze(["ciao, mamma, come, stai_oggi", "ehi, mamma, stai_oggi, cane"])
    vectorizer.fit(data)
    counts = pd.DataFrame(vectorizer.transform(data).toarray())
    counts.columns = [x.replace("'", "")
                      for x in vectorizer.get_feature_names()]
    return counts


def simplify(df, col, bins, group_names):
    df[col] = df[col].fillna(-0.5)
    categories = pd.cut(df[col], bins, labels=group_names)
    df[col] = categories


def encode_features(df):
    features = ['year', 'runtime']

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df


def testClassifier(clf, name, dict):
    y_pred = []
    if name == "Gradient Boosting":
        y_pred = clf.fit(X_train, y_train.values.ravel(), early_stopping_rounds=5,
             eval_set=[(X_test, y_test)], verbose=False).predict(X_test)
        y_pred = [round(value) for value in y_pred]
    else:
        clf = clf.fit(X_train, y_train.values.ravel())
        y_pred = clf.predict(X_test)
        
    # Compute confusion matrix
    CM = confusion_matrix(y_test, y_pred)

    TN = CM[0][0]
    FN = CM[1][0]
    TP = CM[1][1]
    FP = CM[0][1]

    # Sensitivity, hit rate, recall, or true positive rate
    TPR = TP/(TP+FN)
    # Specificity or true negative rate
    TNR = TN/(TN+FP) 
    # Precision or positive predictive value
    PPV = TP/(TP+FP)
    # Negative predictive value
    NPV = TN/(TN+FN)
    # Fall out or false positive rate
    FPR = FP/(FP+TN)
    # False negative rate
    FNR = FN/(TP+FN)
    # False discovery rate
    FDR = FP/(TP+FP)
    # Overall accuracy
    ACC = (TP+TN)/(TP+FP+FN+TN)
    print("{} Scores:\n".format(name))
    print("Accuracy: {0:.2f} %\nPrecision: {1:.2f} %\nRecall: {2:.2f} %\nFall out: {3:.2f} %\nFalse Negative Rate: {4:.2f} %\n\n"
        .format(ACC.round(4)*100.0,PPV.round(4)*100.0,TPR.round(4)*100.0,FDR.round(4)*100.0,FNR.round(4)*100.0))

    dict["classifier"].append(name)
    dict["accuracy"].append(ACC.round(4)*100.0)
    dict["fallout"].append(FDR.round(4)*100.0)
    dict["fnr"].append(FNR.round(4)*100.0)
    return clf

<a id=section3></a>

## 3. Data Profiling

- In the upcoming sections we will first __understand our dataset__ using various pandas functionalities.
- Then with the help of __pandas profiling__ we will find which columns of our dataset need preprocessing.
- In __preprocessing__ we will deal with erronous and missing values of columns. 
- Again we will do __pandas profiling__ to see how preprocessing have transformed our dataset.

#### Importing the Dataset

In [13]:
# load the dataset
ratings = pd.read_csv('data/ratings.csv')
books = pd.read_csv('data/books.csv')
book_tags = pd.read_csv('data/book_tags.csv')
tags = pd.read_csv('data/tags.csv')
to_read = pd.read_csv('data/to_read.csv')

<a id=section301></a>

### 3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end.

Let us check the basic information of the dataset. The very basic information to know is the dimension of the dataset – rows and columns – that’s what we find out with the method __shape__.

#### Understanding the Dataset - Rating

In [None]:
ratings_data.shape

In [None]:
ratings_data.columns

In [None]:
ratings_data.head(2)

In [None]:
ratings_data.info()

In [None]:
ratings_data.describe(include='all')

In [None]:
ratings_data.isnull().sum()

In [None]:
ratings_data.count()

#### Understanding the Dataset - Books

In [None]:
books_data.shape

In [None]:
books_data.columns

In [None]:
books_data.head(2)

In [None]:
books_data.info()

In [None]:
books_data.describe(include='all')

In [None]:
books_data.isnull().sum()

In [None]:
books_data.count()

books_data has __5976479 rows and 3 columns.__

#### Understanding the Dataset - Books Tags

In [None]:
book_tags_data.shape

In [None]:
book_tags_data.columns

In [None]:
book_tags_data.head(2)

In [None]:
book_tags_data.info()

In [None]:
book_tags_data.describe(include='all')

In [None]:
book_tags_data.isnull().sum()

In [None]:
book_tags_data.count()

books_data has __5976479 rows and 3 columns.__

#### Understanding the Dataset - Tags

In [None]:
tags_data.shape

In [None]:
tags_data.columns

In [None]:
tags_data.head(2)

In [None]:
tags_data.info()

In [None]:
tags_data.describe(include='all')

In [None]:
tags_data.isnull().sum()

In [None]:
tags_data.count()

books_data has __5976479 rows and 3 columns.__

#### Understanding the Dataset - To Read Books Data

In [None]:
to_read_data.shape

In [None]:
to_read_data.columns

In [None]:
to_read_data.head(2)

In [None]:
to_read_data.info()

In [None]:
to_read_data.describe(include='all')

In [None]:
to_read_data.isnull().sum()

In [None]:
to_read_data.count()

books_data has __5976479 rows and 3 columns.__

<a id=section302></a>

### 3.2 Pre Profiling

- By pandas profiling, an __interactive HTML report__ gets generated which contins all the information about the columns of the dataset, like the __counts and type__ of each _column_. Detailed information about each column, __correlation between different columns__ and a sample of dataset.<br/>
- It gives us __visual interpretation__ of each column in the data.
- _Spread of the data_ can be better understood by the distribution plot. 
- _Grannular level_ analysis of each column.

In [None]:
profile = pandas_profiling.ProfileReport(ratings)
profile.to_file(outputfile="BeforePreprocessing/ratings_data_before_preprocessing.html")

In [None]:
profile = pandas_profiling.ProfileReport(books)
profile.to_file(outputfile="BeforePreprocessing/books_data_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as __movies_data_before_preprocessing.html__ and __credits_data_before_preprocessing.html__. Take a look at the file and see what useful insight you can develop from it. <br/>
Now we will process our data to better understand it.

<a id=section303></a>

### 3.3 Preprocessing

- Dealing with missing values<br/>
    - Remove duplicates.
    - Drop columns which is not much useful
    - Split the year from the release date
    - Replacing all the zeros from revenue and budget cols.
    - Dropping all the rows with na in the columns mentioned above in the list.
    - Filter records from both table and merge into single table.
    - Parse JSON values from column, which can be use to predict movies rating
    

**Remove duplicates.**

As observed from the dataframe above, some columns contain unneccessary information and some having duplicates values, let's update data frames

In [14]:
ratings.drop_duplicates(keep="first", inplace=True)

In [15]:
books.drop_duplicates(keep="first", inplace=True)

In [16]:
book_tags.drop_duplicates(keep='first', inplace=True)

In [17]:
tags.drop_duplicates(keep='first', inplace=True)

In [22]:
to_read.drop_duplicates(keep='first', inplace=True)

__Drop columns which is not much useful__

Now since there is lot of null values we have in following column, and we are not goint to use them. Hense will those from books data set
- isbn
- isbn13
- original_publication_year
- original_title
- language_code

In [23]:
useless_col = ['isbn', 'isbn13', 'original_publication_year', 'original_title',
               'language_code']
books.drop(useless_col, axis=1, inplace=True)
books.head(1)

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,authors,title,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,Suzanne Collins,"The Hunger Games (The Hunger Games, #1)",4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...


__Removal of Null values__

Now since there still exists 'NaN' values in our dataframe, and these are Null values, we have to do something about them. In here, I will just do the naive thing of replacing these NaNs with zeros as such:

In [19]:
ratings.isnull().sum()

user_id    0
book_id    0
rating     0
dtype: int64

In [25]:
books.isnull().sum()

book_id                    0
goodreads_book_id          0
best_book_id               0
work_id                    0
books_count                0
authors                    0
title                      0
average_rating             0
ratings_count              0
work_ratings_count         0
work_text_reviews_count    0
ratings_1                  0
ratings_2                  0
ratings_3                  0
ratings_4                  0
ratings_5                  0
image_url                  0
small_image_url            0
dtype: int64

In [26]:
tags.isnull().sum()

tag_id      0
tag_name    0
dtype: int64

<a id=section304></a>

### 3.4 Post Pandas Profiling

In [27]:
profile = pandas_profiling.ProfileReport(ratings)
profile.to_file(outputfile="AfterPreprocessing/ratings_data_after_preprocessing.html")

In [28]:
profile = pandas_profiling.ProfileReport(books)
profile.to_file(outputfile="AfterPreprocessing/books_data_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values. You can compare the two reports, i.e __movies_data_after_preprocessing.html__ and __movies_data_before_preprocessing.html__.<br/>
In __movies_data_after_preprocessing.html__ report, observations:
- In the Dataset info, Total __Missing(%)__ = __0.0%__ 
- Number of __variables__ = __13__ 

<a id=section304></a>

__Utils functions__

In [29]:
class ChartType:
    bar_chart = 1
    bar_chart_horizontal = 2
    line_chart = 3
    histogram_chart = 4
    stack_chart = 5
    scatter_chart = 6
    area_chart = 7
    pie_chart = 8


In [30]:
def showChart(data, chart_type, xlabel, ylabel, title=None, figsize=None, axis=None):
    '''
    data : data frame,
    xlabel : The label text for x axis.
    ylabel : The label text for y axis.
    title : The label text for title of chart.
    figsize : tuple of integers, optional, default: None
    axis : The axis limits to be set. Either none or all of the limits must
    be given.
    '''
    # Set figure size of chart
    if figsize != None:
        plt.figure(figsize=figsize)

    # Set x & y axis limit
    if axis != None:
        plt.axis(axis) 

    # Draw bar chart
    if ChartType.bar_chart == chart_type:
        data.plot.bar()
    elif ChartType.bar_chart_horizontal == chart_type:
        data.plot.barh()
    elif ChartType.stack_chart == chart_type:
        data.plot.bar(stacked=True)
    elif ChartType.line_chart == chart_type:
        data.plot.line()
    elif ChartType.histogram_chart == chart_type:
        data.plot.hist()
    elif ChartType.scatter_chart == chart_type:
        data.plot.area()
    elif ChartType.area_chart == chart_type:
        data.plot.area()
    elif ChartType.pie_chart == chart_type:
        plt.pie(data.values,
                       labels=data.index,
                       autopct='%1.2f', startangle=90)
        
#         explode = (0.2, 0, 0, 0, 0, 0)
#         plt.explode = explode
#         plt.autopct='%1.1f%%'
        plt.legend(data.index, loc="best")
        plt.axis('equal')
#         plt.pctdistance=1.1
#         plt.labeldistance=1.2
#         data.plot.pie()
        
    # Set title of chart, y & x axis
    if title != None:
        plt.title(title, fontsize=20)
        
    if xlabel != None:
        plt.xlabel(xlabel, fontsize=10)

    if ylabel != None:
        plt.ylabel(ylabel, fontsize=10)

    # Custom ticks for m axis
    plt.tick_params(axis='x', colors='black', direction='out', length=5, width=1, labelsize='large')
    
    # Custom ticks for m axis
    plt.tick_params(axis='y', colors='black', direction='in', length=5, width=1, labelsize='large')
    
    # Show char
    plt.show()
    

<a id=section4></a>

### 4. Questions

<a id=section401></a>

__How much money does a movie need to make to be profitable?__

<a id=section5></a>

## 5. Types of Recommender Systems

There are two major approaches to build recommender systems: Content-Based Filtering and Collaborative Filtering:

- __Content-Based Filtering__:
    In content-based filtering, the similarity between different products is calculated on the basis of the attributes of the products. For instance, in a content-based movie recommender system, the similarity between the movies is calculated on the basis of genres, the actors in the movie, the director of the movie, etc.

- __Collaborative Filtering__:
    Collaborative filtering leverages the power of the crowd. The intuition behind collaborative filtering is that if a user A likes products X and Y, and if another user B likes product X, there is a fair bit of chance that he will like the product Y as well.

Take the example of a movie recommender system. Suppose a huge number of users have assigned the same ratings to movies X and Y. A new user comes who has assigned the same rating to movie X but hasn't watched movie Y yet. Collaborative filtering system will recommend him the movie Y.

<a id=section12></a>

# Evaluate the model 

<a id=section1701></a>

### Testing with a custom input

<a id=section18></a>

<a id=section13></a>

# Conclusions

- __Samuel Jackson aka Nick Fury__ from avengers has __appeared in maximum movies__. I initially thought that Morgan Freeman might be the actor with maximum movies, but Data wins over assumptions.
- Our **Top Director** is **Steven Spielberg**, with almost all of his production is considered a success.
- The most used **genre** is **Drama**, although **Comedy** movies has very great aspect: **100%** of them are successful.
- **Universal Pictures** is the most prolific company, and also the most successful.
- As we could expect, **USA** is the most prolific and successful country, with no rivals practically.
-  The Top Actor is **Samuel L. Jackson** with 30+ successful movies. Robert De Niro represents an interesting case, passing from 2nd position, when counting the amount of attendees, to out of top-5, if we consider attendees in successful movies only.

We started our analysis with this aim: to find out the key success factors on the film industry, and to try to use that factors in order to predict if a movie will be succesful or not.
__We found out some interesting facts, for example:__

  - older movies had lower runtime
  - budget slightly increased across the years
  - longer movies tend to have higher votes
  - higher budget often means higher revenue and popularity
  
Base on given observation, overall seems the __Random Forest classifier is the best classifier__ with considering accuracy and false positive rate (who was selected as critical measure arbitrarily) followed by __Logistic Regression and K-Nearest Neighbors__ classifier







In [None]:


# load the dataset
rating = pd.read_csv('data/ratings.csv')
books = pd.read_csv('data/books.csv')
book_tags = pd.read_csv('data/book_tags.csv')
tags = pd.read_csv('data/tags.csv')
to_read = pd.read_csv('data/to_read.csv')


In [None]:
tags_join_DF = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how='inner')
tags_join_DF.head()

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(books['authors'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
# Build a 1-dimensional array with book titles
titles = books['title']
indices = pd.Series(books.index, index=books['title'])

# Function that get book recommendations based on the cosine similarity score of book authors
def authors_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    book_indices = [i[0] for i in sim_scores]
    return titles.iloc[book_indices]

In [None]:
authors_recommendations('The Hobbit').head(20)

In [None]:
#checking book_id and the best_book_id
books['check_ids']= np.where(books['book_id'] == books['best_book_id'],'0',np.nan)

In [None]:
books['check_ids'].isnull().values.any()

In [None]:
books.shape

In [None]:
df_book= books[['book_id','books_count','isbn','authors','original_publication_year','title','language_code','average_rating','ratings_count','small_image_url']]

In [None]:
book_tags.rename(columns={'goodreads_book_id':'book_id'}, inplace=True)

In [None]:
book_tags=book_tags.merge(tags,on='tag_id',how='outer')

In [None]:
grouped= book_tags.groupby('book_id')['tag_name'].apply(' '.join)

In [None]:
df_book_new=pd.merge(df_book,grouped.to_frame(), on='book_id', how='inner')

In [None]:
df_book_new.shape

In [None]:
df_book_new.head(1)

In [None]:
df_book_new.shape

In [None]:
#create metadata for similarity using Author,Tags and the language
def create_metadata(x):
    return ''.join(x['authors'])+'  '+''.join(x['tag_name'])+'  '+''.join(str(x['language_code']))

df_book_new['metadata']= df_book_new.apply(create_metadata,axis=1)

In [None]:
df_book_new['metadata']= df_book_new['metadata'].fillna('')

In [None]:
#finding the similarity between two books

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer= TfidfVectorizer(stop_words='english')
vectorizer

In [None]:
df_book_matrix=vectorizer.fit_transform(df_book_new['metadata'])

In [None]:
df_book_matrix.shape

In [None]:
#cosine similarity using linear kernel

from sklearn.metrics.pairwise import linear_kernel

cos_matrix= linear_kernel(df_book_matrix, df_book_matrix)

In [None]:
# 1D array for book title and indices

book_indices= pd.Series(df_book_new.index,index=df_book_new['title'])

In [None]:
def get_recommendations(name,sim):
   # indx=df_book_new.loc[df_book_new['title']==name].index
    indx=book_indices[name]
    sim_scores=list(enumerate(sim[indx]))
    new=sorted(sim_scores,key=lambda x: x[1],reverse=True)
   
    new=new[1:11]
    #print(new)
    book_idx=[x[0] for x in new]
    return (df_book_new['title'].iloc[book_idx])

In [None]:
get_recommendations('The Hunger Games (The Hunger Games, #1)',cos_matrix)

In [None]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics

%matplotlib inline
sns.set()

from subprocess import check_output

In [None]:
# load the dataset
rating = pd.read_csv('data/ratings.csv')
books = pd.read_csv('data/books.csv')
book_tags = pd.read_csv('data/book_tags.csv')
tags = pd.read_csv('data/tags.csv')
to_read = pd.read_csv('data/to_read.csv')