# Introduction.

Project is the continuation of web crawling of website fmovies's [most-watched](https://fmovies.to/most-watched) section analysis for the website. 
This is the second part. In part one we crawled websites and extracted informations. In part two we will tidy and clean the data for analysis in third part.

In [58]:
import pandas as pd
import numpy as np

In [59]:
movie_df = pd.read_csv('../Data/final_movies_df.csv')
tv_df = pd.read_csv('../Data/final_tvs_df.csv')

In [60]:
print(movie_df.columns)
print(tv_df.columns)

Index(['movie_name', 'watch_link', 'date_added', 'site_rank', 'Genre', 'Stars',
       'IMDb', 'Director', 'Release', 'Country', 'Rating'],
      dtype='object')
Index(['tv_name', 'watch_link', 'season', 'episodes', 'date_added',
       'site_rank', 'Genre', 'Stars', 'IMDb', 'Director', 'Release', 'Country',
       'Rating'],
      dtype='object')


In [61]:
movie_df.head()

Unnamed: 0,movie_name,watch_link,date_added,site_rank,Genre,Stars,IMDb,Director,Release,Country,Rating
0,Avengers: Endgame,https://fmovies.to/film/avengers-endgame.xjm5v,2020-07-14,1,"Sci-Fi,Adventure,Action,Fantasy","Don Cheadle,Anthony Mackie,Rene Russo,Mark Ruf...",8.4,"Anthony Russo,Joe Russo",2019-04-22,United States,"7.1/12,640 times"
1,Avengers: Infinity War,https://fmovies.to/film/avengers-infinity-war....,2020-07-14,3,"Sci-Fi,Adventure,Action,Fantasy","William Hurt,Don Cheadle,Anthony Mackie,Benici...",8.5,"Anthony Russo,Joe Russo",2018-04-23,United States,"6.2/21,830 times"
2,Aquaman,https://fmovies.to/film/aquaman.qk91j,2020-07-14,4,"Sci-Fi,Adventure,Action,Fantasy","Nicole Kidman,Michael Beach,Patrick Wilson,Jul...",7.0,James Wan,2018-11-26,"United States,Australia","6.7/8,924 times"
3,Spider-Man: Far from Home,https://fmovies.to/film/spider-man-far-from-ho...,2020-07-14,5,"Sci-Fi,Adventure,Action","Samuel L Jackson,Marisa Tomei,Zendaya,Jake Gyl...",7.5,Jon Watts,2019-06-26,United States,"6.1/6,514 times"
4,Aladdin,https://fmovies.to/film/aladdin.z10p2,2020-07-14,6,"Comedy,Adventure,Romance,Fantasy,Family","Will Smith,Navid Negahban,Numan Acar,Stefan Ka...",7.0,Guy Ritchie,2019-05-08,United States,"6.7/4,979 times"


# Columns

- 'movie_name/ tv_name' : Name of movie / tv 
- 'watch_link': Url link for page to watch movie/tv, 
- 'date_added': Date added to df not in fmovies
- 'site_rank': Ranking in the fmovies by order of most watched starting from 1.
- 'Genre': Genres
- 'Stars': Cast,
- 'IMDb': IMDb ratings,
- 'Director': Director, 
- 'Release': Released Date for Movie/TV,
- 'Country': Origin country can be more than one
- 'Rating'- Average reviews by viewers on the fmovies.to websie
- 'season' - Which season, only for tv shows
- 'episodes' - Number of episoded available for tv shows 


## Rename Columns All Uppercase

In [62]:
movie_df.columns = movie_df.columns.str.upper().tolist()
tv_df.columns = tv_df.columns.str.upper().tolist()

In [63]:
tv_df.head(2)

Unnamed: 0,TV_NAME,WATCH_LINK,SEASON,EPISODES,DATE_ADDED,SITE_RANK,GENRE,STARS,IMDB,DIRECTOR,RELEASE,COUNTRY,RATING
0,Game of Thrones,https://fmovies.to/film/game-of-thrones.3yl2,8,6,2020-07-14,2,"Drama,Adventure,Fantasy","Peter Dinklage,Kit Harington,Emilia Clarke",9.3,"David Benioff,D.b. Weiss",2011-04-17,"United States,United Kingdom","6.3/20,443 times"
1,The Big Bang Theory,https://fmovies.to/film/the-big-bang-theory.n5x8,12,23,2020-07-14,16,"Comedy,Romance","Bob Newhart,Sara Gilbert,Kaley Cuoco,Wil Wheat...",8.1,"Mark Cendrowski,Anthony Rich,Peter Chakos,Nico...",2006-05-01,United States,"6.2/5,406 times"


In [64]:
movie_df.head(2)

Unnamed: 0,MOVIE_NAME,WATCH_LINK,DATE_ADDED,SITE_RANK,GENRE,STARS,IMDB,DIRECTOR,RELEASE,COUNTRY,RATING
0,Avengers: Endgame,https://fmovies.to/film/avengers-endgame.xjm5v,2020-07-14,1,"Sci-Fi,Adventure,Action,Fantasy","Don Cheadle,Anthony Mackie,Rene Russo,Mark Ruf...",8.4,"Anthony Russo,Joe Russo",2019-04-22,United States,"7.1/12,640 times"
1,Avengers: Infinity War,https://fmovies.to/film/avengers-infinity-war....,2020-07-14,3,"Sci-Fi,Adventure,Action,Fantasy","William Hurt,Don Cheadle,Anthony Mackie,Benici...",8.5,"Anthony Russo,Joe Russo",2018-04-23,United States,"6.2/21,830 times"


# Tidying

1. Genre section has list of values in one row, lets make one value per row.
2. Released Data can be converted to date time and then to index of df
3. Ratings have to values, 1st is the site ratings and second is number of reviews by viewers. Lets separate them different columns.

## Genre Split and Date Column

Lets make a function that splits and stacks the genre into multiple rows, like [this](https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows/21032532). More, lets just reset index to release date.

In [65]:
def split_genre(df):
   
    cp= df.copy()
    
    # Spilt the genre by "," and stack to make muliple rows each with own unique genre
    # this will return a new df with genres only
    genre=  cp.GENRE.str.split(',').apply(pd.Series, 1).stack()
    
    # Pop one of index
    genre.index = genre.index.droplevel(-1)
    
    # Provide name to series
    genre.name= "GENRE"
    
    
    #delete the original genre from original df
    cp.drop("GENRE", axis=True, inplace=True)
    
    # Create a new df 
    new_df = cp.copy().join(genre)
    # change release date from string to datetime and drop release column
    new_df['Date'] = pd.to_datetime(new_df['RELEASE'], format="%Y-%m-%d")
    new_df.drop('RELEASE', axis=1, inplace=True)
    # Reset index
    new_df.set_index('Date',drop=True, inplace=True)
    
    return new_df

In [66]:
movie_df_tidy_1 = split_genre(movie_df)

In [67]:
tv_df_tidy_1 = split_genre(tv_df)

## Ratings Columns Split 

In [68]:
site_user_rating_4movie = movie_df_tidy_1.RATING.str.split("/").str[0]
site_number_user_rated_4movie = movie_df_tidy_1.RATING.str.split("/").str[1].str.split(" ").str[0]


In [69]:
site_user_rating_4tv = tv_df_tidy_1.RATING.str.split("/").str[0]
site_number_user_rated_4tv = tv_df_tidy_1.RATING.str.split("/").str[1].str.split(" ").str[0]


### Assign  New cols and Drop the olds

In [70]:
tv_df_tidy_2 = tv_df_tidy_1.copy()
movie_df_tidy_2= movie_df_tidy_1.copy()

In [71]:
movie_df_tidy_2['User_Reviews_local'] = site_user_rating_4movie
movie_df_tidy_2['Number_Reviews_local'] = site_number_user_rated_4movie

In [72]:
tv_df_tidy_2['User_Reviews_local'] = site_user_rating_4tv
tv_df_tidy_2['Number_Reviews_local'] = site_number_user_rated_4tv

In [73]:
tv_df_tidy_2.drop('RATING', inplace=True,axis=1)
movie_df_tidy_2.drop('RATING', inplace=True,axis=1)

# Missing Vlaues

In [74]:
print(movie_df_tidy_2.info())

print("**"*20)
print(tv_df_tidy_2.info())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3790 entries, 2019-04-22 to 2007-02-09
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   MOVIE_NAME            3790 non-null   object 
 1   WATCH_LINK            3790 non-null   object 
 2   DATE_ADDED            3790 non-null   object 
 3   SITE_RANK             3790 non-null   int64  
 4   STARS                 3788 non-null   object 
 5   IMDB                  3788 non-null   float64
 6   DIRECTOR              3788 non-null   object 
 7   COUNTRY               3788 non-null   object 
 8   GENRE                 3788 non-null   object 
 9   User_Reviews_local    3788 non-null   object 
 10  Number_Reviews_local  3788 non-null   object 
dtypes: float64(1), int64(1), object(9)
memory usage: 355.3+ KB
None
****************************************
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 764 entries, 2011-04-17 to 2019-03-28
Data column

It seems only movies has null vaules, lets dive deeper.

In [75]:
movie_df_tidy_2[movie_df_tidy_2.GENRE.isnull()]

Unnamed: 0_level_0,MOVIE_NAME,WATCH_LINK,DATE_ADDED,SITE_RANK,STARS,IMDB,DIRECTOR,COUNTRY,GENRE,User_Reviews_local,Number_Reviews_local
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
NaT,Brightburn,https://fmovies.to/film/brightburn.pj38q,2020-07-14,68,,,,,,,
NaT,The Wolverine,https://fmovies.to/film/the-wolverine.7jm7,2020-07-14,1062,,,,,,,


 Earlier to prevent prolongation of crawling, we returned nan for bad requests. We can individually go throguh each link to values but lets drop them for now.

In [76]:
movie_df_tidy_2.dropna(inplace=True,axis=0)

# Write file for analysis part

Index false argument on write will remove date index so lets not do that.

In [77]:
movie_df_tidy_2.to_csv('../Data/Movie.csv')
tv_df_tidy_2.to_csv('../Data/TV.csv')