# Exploratory Data Analysis (EDA)

EDA is one of the most important first steps when dealing with new data.It allows us to better understand the data and the tools we can use in future steps.
![EDA](https://i.postimg.cc/dQ6tFJV8/EDA.jpg)

## Setting up our DataFrame
we'll be exploring the IMDB Movies dataset from Kaggle!

### Closer Look at the Variables
Training set contains 1000 samples and 16 variables:

*    **Poster_Link** - Link of the poster that imdb using
*    **Series_Title** - Name of the movie
*    **Released_Year** - Year at which that movie released
*    **Certificate** - Certificate earned by that movie
*    **Runtime** - Total runtime of the movie
*    **Genre** - Genre of the movie
*    **IMDB_Rating** - Rating of the movie at IMDB site
*    **Overview** - mini story/ summary
*    **Meta_score** - Score earned by the movie
*    **Director** - Name of the Director
*    **Star1,Star2,Star3,Star4** - Name of the Stars
*    **Noofvotes** - Total number of votes
*    **Gross** - Money earned by that movie


In [1]:
##
import warnings 
warnings.filterwarnings('ignore')

In [47]:
import pandas as pd
import numpy as np

In [48]:
pd.set_option('display.max_columns', 85)

In [49]:
df = pd.read_csv("./dataset/imdb_top_1000.csv")

## Step1: Understand the data
An introductory step is to look at the content of the data to get an idea of what you're going to be dealing with.
We can gain insight on the data with the following commands:

* **df.shape** (row, column)
* **df.columns** (column titles)
* **df.head** (top 5 results in table format)
* **df.nunique** (count of unique values for each variable/dimension)
* **df.isnull().sum()** (number of missing values in the data set) 
* **describe()** (description of the data in the DataFrame)

In [50]:
shape = df.shape
print(f"The shape of the dataset is : {shape[0]} rows and {shape[1]} columns")

The shape of the dataset is : 1000 rows and 16 columns


In [6]:
df.columns

Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Certificate',
       'Runtime', 'Genre', 'IMDB_Rating', 'Overview', 'Meta_score', 'Director',
       'Star1', 'Star2', 'Star3', 'Star4', 'No_of_Votes', 'Gross'],
      dtype='object')

In [7]:
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [8]:
df = df.rename(columns={"Series_Title": "Movies_Title"})
df.head(3)

Unnamed: 0,Poster_Link,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [9]:
df.nunique()

Poster_Link      1000
Movies_Title      999
Released_Year     100
Certificate        16
Runtime           140
Genre             202
IMDB_Rating        17
Overview         1000
Meta_score         63
Director          548
Star1             660
Star2             841
Star3             891
Star4             939
No_of_Votes       999
Gross             823
dtype: int64

In [10]:
df.isnull().sum()

Poster_Link        0
Movies_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

In [11]:
df.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


## Step 2: Data Cleaning
Data cleaning is used to not only increase the data integrity, but to also make it more usable and understandable to humans.

**Broad Overview of Cleaning Steps**

* **Handle empty values:** you can delete them, fill them with a value that makes sense
* **Check for data consistency:** case may be important for strings, formatting, etc
* **Handle outliers**
* **Remove duplicates**
* **Validate correctness of entries:** age columns shouldn't contain text, for instance

![cleaning](https://i.postimg.cc/5y2JyF28/cleaning-data.png)

### Check for data consistency

In [12]:
df.head(1)

Unnamed: 0,Poster_Link,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469


In [13]:
df['Gross'] = df['Gross'].str.replace(',', '')
df.head(1)

Unnamed: 0,Poster_Link,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469


In [14]:
df["Gross"].fillna(0 ,inplace=True)

In [15]:
df['Gross'] = df['Gross'].astype(int)

In [16]:
df["Gross"]

0       28341469
1      134966411
2      534858444
3       57300000
4        4360000
         ...    
995            0
996            0
997     30500000
998            0
999            0
Name: Gross, Length: 1000, dtype: int64

### Handle empty values
* The **fillna()** function is used to fill NA/NaN values using the specified method.
* The **dropna()** function is used to remove rows and columns with Null/NaN values

In [17]:
df.isnull().sum()

Poster_Link        0
Movies_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross              0
dtype: int64

In [18]:
df[df['Certificate'].isnull()]

Unnamed: 0,Poster_Link,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
30,https://m.media-amazon.com/images/M/MV5BYjBmYT...,Seppuku,1962,,133 min,"Action, Drama, Mystery",8.6,When a ronin requesting seppuku at a feudal lo...,85.0,Masaki Kobayashi,Tatsuya Nakadai,Akira Ishihama,Shima Iwashita,Tetsurô Tanba,42004,0
54,https://m.media-amazon.com/images/M/MV5BNWJhMD...,Ayla: The Daughter of War,2017,,125 min,"Biography, Drama, History",8.4,"In 1950, amid-st the ravages of the Korean War...",,Can Ulkay,Erdem Can,Çetin Tekindor,Ismail Hacioglu,Kyung-jin Lee,34112,0
77,https://m.media-amazon.com/images/M/MV5BOTI4NT...,Tengoku to jigoku,1963,,143 min,"Crime, Drama, Mystery",8.4,An executive of a shoe company becomes a victi...,,Akira Kurosawa,Toshirô Mifune,Yutaka Sada,Tatsuya Nakadai,Kyôko Kagawa,34357,0
92,https://m.media-amazon.com/images/M/MV5BNjAzMz...,Babam ve Oglum,2005,,112 min,"Drama, Family",8.3,The family of a left-wing journalist is torn a...,,Çagan Irmak,Çetin Tekindor,Fikret Kuskan,Hümeyra,Ege Tanman,78925,0
121,https://m.media-amazon.com/images/M/MV5BZmM0NG...,Ikiru,1952,,143 min,Drama,8.3,A bureaucrat tries to find a meaning in his li...,,Akira Kurosawa,Takashi Shimura,Nobuo Kaneko,Shin'ichi Himori,Haruo Tanaka,68463,55240
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
920,https://m.media-amazon.com/images/M/MV5BMjEzMj...,The Secret of Kells,2009,,71 min,"Animation, Adventure, Family",7.6,A young boy in a remote medieval outpost under...,81.0,Tomm Moore,Nora Twomey,Evan McGuire,Brendan Gleeson,Mick Lally,31779,686383
926,https://m.media-amazon.com/images/M/MV5BMTI5Mz...,Dead Man's Shoes,2004,,90 min,"Crime, Drama, Thriller",7.6,A disaffected soldier returns to his hometown ...,52.0,Shane Meadows,Paddy Considine,Gary Stretch,Toby Kebbell,Stuart Wolfenden,49728,6013
944,https://m.media-amazon.com/images/M/MV5BMDc2MG...,Batoru rowaiaru,2000,,114 min,"Action, Adventure, Drama",7.6,"In the future, the Japanese government capture...",81.0,Kinji Fukasaku,Tatsuya Fujiwara,Aki Maeda,Tarô Yamamoto,Takeshi Kitano,169091,0
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,0



<table><tr>
<td> <img src="https://i.postimg.cc/yd6tyCCH/fillna1.png" alt="Drawing" style="width: 290px;"/> </td>
<td> <img src="https://i.postimg.cc/SQ1FVRKt/fillna2.png" alt="Drawing" style="width: 290px;"/> </td>
</tr></table>

In [19]:
df["Certificate"] = df["Certificate"].fillna("Not Rated")
df["Meta_score"] = df["Meta_score"].fillna(0) 

In [20]:
df.isnull().sum()

Poster_Link      0
Movies_Title     0
Released_Year    0
Certificate      0
Runtime          0
Genre            0
IMDB_Rating      0
Overview         0
Meta_score       0
Director         0
Star1            0
Star2            0
Star3            0
Star4            0
No_of_Votes      0
Gross            0
dtype: int64

* drop `Poster_Link` column because it is not useful in our case

In [21]:
df.drop('Poster_Link', axis=1, inplace=True)

In [22]:
df.head()

Unnamed: 0,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


### Check for duplicate data
* The **duplicated()** method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. 
* The **drop_duplicates()** method returns DataFrame with duplicate rows removed

In [56]:
df.loc[df.shape[0], :] = df.iloc[-1]

In [57]:
df

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110.0,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367.0,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952.0,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845.0,4360000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,https://m.media-amazon.com/images/M/MV5BODk3Yj...,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075.0,
997,https://m.media-amazon.com/images/M/MV5BM2U3Yz...,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374.0,30500000
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471.0,
999,https://m.media-amazon.com/images/M/MV5BMTY5OD...,The 39 Steps,1935,,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,Lucie Mannheim,Godfrey Tearle,51853.0,


In [24]:
df.tail(3)

Unnamed: 0,Movies_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
998,Lifeboat,1944,Not Rated,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471.0,0.0
999,The 39 Steps,1935,Not Rated,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,Lucie Mannheim,Godfrey Tearle,51853.0,0.0
1000,The 39 Steps,1935,Not Rated,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,Lucie Mannheim,Godfrey Tearle,51853.0,0.0


In [59]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000     True
Length: 1001, dtype: bool

In [58]:
n_duplicates = df.duplicated().sum()
print('Number of duplicated rows: {}'.format(n_duplicates))
if n_duplicates > 0:
    duplicated_df = df[df.duplicated()]
    display(duplicated_df)
    print()
    display(df.drop_duplicates())

Number of duplicated rows: 1


Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
1000,https://m.media-amazon.com/images/M/MV5BMTY5OD...,The 39 Steps,1935,,86 min,"Crime, Mystery, Thriller",7.6,A man in London tries to help a counter-espion...,93.0,Alfred Hitchcock,Robert Donat,Madeleine Carroll,Lucie Mannheim,Godfrey Tearle,51853.0,





Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110.0,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367.0,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952.0,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845.0,4360000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,https://m.media-amazon.com/images/M/MV5BNGEwMT...,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544.0,
996,https://m.media-amazon.com/images/M/MV5BODk3Yj...,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075.0,
997,https://m.media-amazon.com/images/M/MV5BM2U3Yz...,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374.0,30500000
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471.0,


Executing shutdown due to inactivity...


2022-07-14 13:48:19,103 - INFO     - Executing shutdown due to inactivity...


Executing shutdown...


2022-07-14 13:48:19,126 - INFO     - Executing shutdown...


Exception on /shutdown [GET]
Traceback (most recent call last):
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/dtale/app.py", line 410, in shutdown
    shutdown_server()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/dtale/app.py", line 396, in shutdown_server
    raise RuntimeE

2022-07-14 13:48:19,130 - ERROR    - Exception on /shutdown [GET]
Traceback (most recent call last):
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 2077, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1525, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1523, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/flask/app.py", line 1509, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/dtale/app.py", line 410, in shutdown
    shutdown_server()
  File "/home/kamyar/miniconda3/envs/vision/lib/python3.8/site-packages/dtale/app.py", line 396, 

* now lets remove `min` in `Runtime` column so we will have `142` instead of `142 min`

In [61]:
df['Runtime']

0       142 min
1       175 min
2       152 min
3       202 min
4        96 min
         ...   
996     201 min
997     118 min
998      97 min
999      86 min
1000     86 min
Name: Runtime, Length: 1001, dtype: object

In [63]:
df["duration"] = df["Runtime"].apply(lambda x: int(x.split(" ")[0]))
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,duration
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110.0,28341469,142
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367.0,134966411,175
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444,152
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952.0,57300000,202
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845.0,4360000,96


## Data Visualization

### Plotly
Plotly's Python graphing library creates interactive, publication-ready graphs. Within Plotly, there is Plotly Express, which is a high level API designed to be as consistent and easy to learn as possible.
![plotly](https://i.postimg.cc/Dy4DgcH8/plotly.png)

In [64]:
import plotly.express as px

In [65]:
data = pd.read_csv('./dataset/train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<div>
 <img src = "https://i.postimg.cc/fLbq07YW/titanic.jpg" width="300 px" />
</div>

### Scatter plot

In [66]:
data["Survived"] = data["Survived"].astype(str)

In [71]:
fig = px.scatter(data, x='Fare', y='Age', color='Fare', symbol='Survived')
fig.show()
fig.write_html('test.html')

### Pie Plot

In [31]:
fig = px.pie(data, names='Survived', title='Passenger Survival')
fig.show()

### Histogram

In [32]:
fig = px.histogram(data, x='Age', nbins=30, marginal='box')
fig.show()

### Box Plot

In [72]:
fig = px.box(data, x= 'Pclass', y= 'Age')
fig.show()

>**❗ NOTE:**  If you want to see the underlying data, you just have to pass 'all' to the points parameter like this:

In [34]:
fig = px.box(data, x='Pclass', y="Age", points="all")
fig.show()

In [35]:
fig = px.box(data, x='Pclass', y="Age", color="Survived")
fig.show()

### Heatmap

In [36]:
fig = px.density_heatmap(data, x="Embarked", y="Pclass",
                        height=500, width=500)
fig.show()

We see that the most frequent pairing between `Embarked` and `Pclas`s is `S`, 3, meaning that most passengers that embarked from Southampton were in third class

>**❗ NOTE:**You can also easily get a correlation heatmap by calling px.imshow on **df.corr()** like so:

In [73]:
data.corr()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Pclass,-0.035144,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,-0.5495,0.096067,0.159651,0.216225,1.0


In [37]:
fig = px.imshow(data.corr(), 
                title='Correlations Among Training Features',
                height=700, width=700)
fig.show()

In [38]:
data.corr()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Pclass,-0.035144,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,-0.5495,0.096067,0.159651,0.216225,1.0


### Why is looking at the correlation of features important?
How does correlation help in feature selection? Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

### 3-D Dimensional

In [39]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [40]:
fig = px.scatter_3d(data, x='Pclass', y='Fare', z='Age',
              color='Survived')
fig.show()

### More on plotly
`Dash` was made by Plotly's creators as a way to easily implement a web interface and create dashboards with Plotly without having to learn javascript, html and other web technologies. With Dash you don't make visualizations, you build an interface to display Plotly's visualizations.  
Some of its interesting applications are :  
* **Dash brain viewer:** https://dash.gallery/dash-brain-viewer/  
* **Transport routes analysis:** https://dash.gallery/transport-routes-analysis/

## Exploratory Data Analysis Tools (2022)
Imagine being able to generate insights from your data within minutes, whilst a data scientist takes over an hour using R or Python. That’s the benefit of using these exploratory data analysis tools.

The five most popular Python EDA tools: DataPrep, Pandas-profiling, SweetViz, Lux, and D-Tale. We will be focusing on D-Tale in this lecture.  
**DTale** is a Graphical Interface where we can select the data we want to analyze and how to analyze using different graphs and plots.   
![EDA](https://i.postimg.cc/SQYFTqzS/EDA2.png)

In [41]:
# !pip install dtale

In [45]:
data = data.fillna(0)

In [46]:
import dtale
d = dtale.show(data)
d

