## Exploratory data analysis for ivi movies

In [175]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

## First Step

### Getting to know the data 

In [176]:
df = pd.read_csv("data_before.csv",index_col=0)

In [177]:
df.head()

Unnamed: 0,Name,Year,Duration,Age_restriction,Country,Genre,Rating,Number_of_reviews
0,777 Чарли,2022,163,16+,Индия,Драмы,9.3,35.0
1,1+1,2011,112,18+,Франция,Драмы,9.0,390.0
2,Леон,1994,124,16+,Франция,Триллеры,8.9,3.0
3,Зеленая книга,2018,124,18+,США,Комедии,9.1,227.0
4,Бойцовский клуб,1999,133,18+,США,Триллеры,8.6,27.0


In [178]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 0 to 119
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               120 non-null    object 
 1   Year               120 non-null    int64  
 2   Duration           120 non-null    object 
 3   Age_restriction    120 non-null    object 
 4   Country            120 non-null    object 
 5   Genre              120 non-null    object 
 6   Rating             120 non-null    float64
 7   Number_of_reviews  88 non-null     float64
dtypes: float64(2), int64(1), object(5)
memory usage: 8.4+ KB


In [179]:
df.rename(columns={"Duration":"Duration,m"},inplace=True)

### Checking for None

In [180]:
df.isna().mean()

Name                 0.000000
Year                 0.000000
Duration,m           0.000000
Age_restriction      0.000000
Country              0.000000
Genre                0.000000
Rating               0.000000
Number_of_reviews    0.266667
dtype: float64

We can notice there is none in the column "Number_of_reviews"

We will replace None with 0. This will mean that there are no comments

In [181]:
df.Number_of_reviews.fillna(0,inplace=True)

In [182]:
df.isna().mean()

Name                 0.0
Year                 0.0
Duration,m           0.0
Age_restriction      0.0
Country              0.0
Genre                0.0
Rating               0.0
Number_of_reviews    0.0
dtype: float64

### Checking for Duplicates

let's check if there are matching rows. Delete them, if any

In [183]:
df[df.duplicated() == True].count()

Name                 30
Year                 30
Duration,m           30
Age_restriction      30
Country              30
Genre                30
Rating               30
Number_of_reviews    30
dtype: int64

In [184]:
df = df.drop_duplicates().reset_index(drop=True)

In [185]:
df.duplicated().sum()

0

### Data types

In [186]:
df.dtypes

Name                  object
Year                   int64
Duration,m            object
Age_restriction       object
Country               object
Genre                 object
Rating               float64
Number_of_reviews    float64
dtype: object

let's change the data type for duration and for Number_of_reviews    

In [187]:
df["Duration,m"].head(10)

0        163
1        112
2        124
3        124
4        133
5         92
6        122
7    2 серии
8        100
9        132
Name: Duration,m, dtype: object

We can see that in our data, in addition to movies, there are also series. We analyze the best movies, so we will delete the lines with the series

In [188]:
def drop_series(df):
    indx = []
    for i in df["Duration,m"].index:
        if df["Duration,m"].values[i].isdigit():
            pass
        else:
            indx.append(i)
    df.drop(index=indx,axis=0,inplace=True)
    df.reset_index(drop=True)


In [189]:
drop_series(df)
df["Duration,m"] = df["Duration,m"].astype("int64")

In [191]:
df.Number_of_reviews = df.Number_of_reviews.astype("int64")

In [194]:
df.dtypes

Name                  object
Year                   int64
Duration,m             int64
Age_restriction       object
Country               object
Genre                 object
Rating               float64
Number_of_reviews      int64
dtype: object

In [193]:
df.head()

Unnamed: 0,Name,Year,"Duration,m",Age_restriction,Country,Genre,Rating,Number_of_reviews
0,777 Чарли,2022,163,16+,Индия,Драмы,9.3,35
1,1+1,2011,112,18+,Франция,Драмы,9.0,390
2,Леон,1994,124,16+,Франция,Триллеры,8.9,3
3,Зеленая книга,2018,124,18+,США,Комедии,9.1,227
4,Бойцовский клуб,1999,133,18+,США,Триллеры,8.6,27


## Second Step

### Visualization for Data Analysis