# Decision Tree

In [1]:
import pandas as pd

## Phase0: Data Analysis


### loading data

In [2]:
data=pd.read_csv("dataset.csv")
data

Unnamed: 0,type,title,cast,country,release_year,listed_in,description
0,Movie,Dick Johnson Is Dead,,United States,2020,Documentaries,"As her father nears the end of his life, filmm..."
1,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,TV Show,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,TV Show,Jailbirds New Orleans,,,2021,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,TV Show,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...
11054,TV Show,X-Men: Evolution,"Noel Fisher, Vincent Gale, Christopher Judge, ...",United States,2000,"Action-Adventure, Animation, Kids",X-Men: Evolution features the team as teenager...
11055,TV Show,Smart Guy,"Tahj Mowry, John Jones, Jason Weaver, Essence ...",United States,1996,"Comedy, Coming of Age, Kids",A genius tries to fit in as a high school soph...
11056,TV Show,Disney Kirby Buckets,"Jacob Bertrand, Mekai Curtis, Cade Sutton, Oli...",United States,2014,"Action-Adventure, Comedy, Coming of Age",Welcome to Kirby's world! It's rude and sketchy.
11057,TV Show,Disney Mech-X4,"Nathaniel Potvin, Raymond Cham, Kamran Lucas, ...",Canada,2016,"Action-Adventure, Comedy, Science Fiction",Ryan discovers his ability to control a giant ...


### Data Information

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11059 entries, 0 to 11058
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          11059 non-null  object
 1   title         11059 non-null  object
 2   cast          9694 non-null   object
 3   country       8364 non-null   object
 4   release_year  11059 non-null  int64 
 5   listed_in     11059 non-null  object
 6   description   11059 non-null  object
dtypes: int64(1), object(6)
memory usage: 604.9+ KB


In [4]:
data.describe()

Unnamed: 0,release_year
count,11059.0
mean,2014.209603
std,8.959517
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


### Missing Data

In [5]:
data.isnull().sum(axis = 0)

type               0
title              0
cast            1365
country         2695
release_year       0
listed_in          0
description        0
dtype: int64

## Phase1: Preprocess

### Dealing With Missing Values
There are several ways to deal with missing values, here two of them is discussed:
 - deleting data: we may choose to remove rows with missing data if the number of rows with missing data is insignifiant, or if it is not possible to easily estimate them.
 - filling data: we may choose to fill the missing values with an estimation. this can be a predetermined default value or attributes like mean/median/mode. the downside of this method is the loss of variance and correlation with other data. We may also choose to use more complex methods like ML algorithms to estimate missing data, this can solve the problems mentioned earlier, however it highly depends on the method. 
 in this case we chose to remove the row with missing value for "cast" column and replace the missing values of "country" column with the mode of other data. The reason is that

In [6]:
data['country'] = data['country'].fillna(data['country'].mode()[0]);
data['cast'] = data['cast'].fillna("");
data.isnull().sum(axis = 0)

type            0
title           0
cast            0
country         0
release_year    0
listed_in       0
description     0
dtype: int64

### Normalization and Standardization

In [7]:
data['release_year']=(data['release_year']-data['release_year'].mean())/data['release_year'].std()

In [8]:
data.head()

Unnamed: 0,type,title,cast,country,release_year,listed_in,description
0,Movie,Dick Johnson Is Dead,,United States,0.646285,Documentaries,"As her father nears the end of his life, filmm..."
1,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,0.757898,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,TV Show,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,0.757898,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,TV Show,Jailbirds New Orleans,,United States,0.757898,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,TV Show,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,0.757898,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


### Categorical Data

In [9]:
labels = data['type'].astype('category').cat.categories.tolist()
labels

['Movie', 'TV Show']

In [10]:
replace_map = {'type' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
data.replace(replace_map, inplace=True)

In [11]:
data.head()
#replace_map

Unnamed: 0,type,title,cast,country,release_year,listed_in,description
0,1,Dick Johnson Is Dead,,United States,0.646285,Documentaries,"As her father nears the end of his life, filmm..."
1,2,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,0.757898,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,2,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,0.757898,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,2,Jailbirds New Orleans,,United States,0.757898,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,2,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,0.757898,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


### Listed Data

In [13]:
for i in range(len(data["listed_in"])):
    x=data["listed_in"][i].split(",");
    data["listed_in"][i]=x
data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["listed_in"][i]=x


Unnamed: 0,type,title,cast,country,release_year,listed_in,description
0,1,Dick Johnson Is Dead,,United States,0.646285,[Documentaries],"As her father nears the end of his life, filmm..."
1,2,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,0.757898,"[International TV Shows, TV Dramas, TV Myste...","After crossing paths at a party, a Cape Town t..."
2,2,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,0.757898,"[Crime TV Shows, International TV Shows, TV ...",To protect his family from a powerful drug lor...
3,2,Jailbirds New Orleans,,United States,0.757898,"[Docuseries, Reality TV]","Feuds, flirtations and toilet talk go down amo..."
4,2,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,0.757898,"[International TV Shows, Romantic TV Shows, ...",In a city of coaching centers known to train I...


### Count Vectorizer

## Phase2: Decision Tree

## Phase3: Random Forest