> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Once you complete this project, remove these **Tip** sections from your report before submission. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate a Dataset - [tmdb-movies]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

We have a data set containing 10,000 films. We will do some analyzes to answer some questions
### Question(s) for Analysis
 1-How many films have a rating greater than or equal to 7 and other films that have a rating less than 7, and has there been an impact on the budget of these films, the time they are shown, or the genre of these films
 
 
 2- Which is better, old or new movies?
> **Tip**: Once you start coding, use NumPy arrays, Pandas Series, and DataFrames where appropriate rather than Python lists and dictionaries. Also, **use good coding practices**, such as, define and use functions to avoid repetitive code. Use appropriate comments within the code cells, explanation in the mark-down cells, and meaningful variable names. 

In [None]:
#import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
movie=pd.read_csv('tmdb-movies.csv')

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**


### General Properties
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

In [None]:
movie.head(1)

In [None]:
movie.shape

In [None]:
movie.describe()


### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

In [None]:
#Detecting duplicates

movie.duplicated().sum()

In [None]:
#delete duplicates

movie.drop_duplicates(inplace=True)


In [None]:
#Delete some columns that will not help
movie.drop(['homepage','tagline','keywords','release_date','production_companies','budget_adj','revenue_adj'],
           axis=1, inplace=True)

movie.head(1)

In [None]:
#Delete outliers

error_runtime=movie.query('runtime==0.000000')
error_revenue=movie.query('revenue==0.000000')
error_budget=movie.query('budget==0.000000')
ru=list(error_runtime.index)
re=list(error_revenue.index)
bu=list(error_budget.index)
movie.drop(index=ru,axis=1,inplace=True)
movie.drop(index=re,axis=1,inplace=True)
movie.drop(index=bu,axis=1,inplace=True)


In [None]:
movie.describe()

In [None]:
#Fill in the missing values
mean = movie['vote_average'].mean()
movie['vote_average'] = movie['vote_average'].fillna(mean)
mean = movie['vote_average'].mean()
movie['vote_average'].fillna(mean, inplace = True)
mean = movie['budget'].mean()
movie['budget'] = movie['budget'].fillna(mean)
mean = movie['budget'].mean()
movie['budget'].fillna(mean, inplace = True)
mean = movie['revenue'].mean()
movie['revenue'] = movie['revenue'].fillna(mean)
mean = movie['revenue'].mean()
movie['revenue'].fillna(mean, inplace = True)

In [None]:
movie.describe()

# Data cleaning phase summary
* Delete duplicate rows
* Deleting some columns that are useless in answering our questions
* Deleting anomalies from the columns (running period, income)
* Fill in the missing data in the columns (budget, income, average vote)

In [None]:
movie.info()

## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


In [None]:
#data review
movie.hist();

In [None]:
#Data for films with a rating greater than or equal to 7
high_score=movie.vote_average>=7
#Data for films with a rating less than or equal to 7
low_score=movie.vote_average<7
#Create a new column
new_col=movie['score']=movie.vote_average>=7

# How many films have a rating greater than or equal to 7 and other films that have a rating less than 7, and has there been an impact on the budget of these films, the time they are shown, or the genre of these films


Percentage of films with a score of seven or more for films with a rating of less than seven


In [None]:
#Comparison of the number of films that got more than 7 and less than 7.
def Percentage_of_films(movie,col_name):
    
    colors=['red','green']
    plt.figure(figsize=[12,8])
    movie.groupby(col_name)['vote_average'].count().plot(kind='pie')
    plt.legend();
    plt.title('comparison between number movies to')
    plt.xlabel('movie')
    plt.ylabel('movie number')
    movie.groupby('score')['vote_average'].count()
Percentage_of_films(movie,'score')

*It turns out that the number of films with a rating of 7 or more is much less than films with a rating of less than 7.


# The effect of the average budget of films on obtaining a high rating


In [None]:
#Comparison between the budget of the films and the degree you got in the vote
movie.groupby('score')['budget'].mean()

In [None]:
def Percentage_of_films(movie,col_name):
    
    colors=['gray','green']
    plt.figure(figsize=[18,6])
    movie.groupby('score')['budget'].mean().plot(kind='bar',title="The effect of a film's average budget on obtaining a higher rating",color=colors,alpha=.7)
Percentage_of_films(movie,'score')    

*Movies with a bigger budget have a higher average rating


# The relationship of the average playing time of movies and their obtaining a high rating


In [None]:
#Comparison between the duration of the films and the degree to which they got in the vote
plt.figure(figsize=[18,6])
movie.groupby('score')['runtime'].mean().plot(kind='bar',title="Relationship to the duration of the film's presentation and obtaining a higher rating " ,color=['red','blue'],alpha=.5)
movie.groupby('score')['runtime'].mean()

*The movies with the longest running time got the highest rating


In [None]:
# Most movie genre with a rating equal to 7 or more
movie[high_score]['genres'].mode()

In [None]:
# Most movie genre with a rating of less than seven
movie[low_score]['genres'].mode()

# Which is better, old or new movies?(old films before 1990)

Which films have the highest average rating, old or modern films 

In [None]:
#Do newer movies get a higher rating or the old ones?
plt.figure(figsize=[18,6])
movie.groupby('release_year')['vote_average'].mean().plot(kind='bar')

plt.title('comparison vote_average. to release_year')
plt.xlabel('release_year')
plt.ylabel('vote_averge')

*It is clear from this data that the old films got a relatively higher rating


# Do movies with high ratings necessarily get high popularity?


In [None]:
#Do movies with a higher rating necessarily be more popular?
plt.figure(figsize=[18,6])
movie.groupby('vote_average')['popularity'].mean().plot()

plt.title('comparison vote_average. to popularity')
plt.xlabel('vote average')
plt.ylabel('popularity')

*Obviously, high-rated movies got more popular


# Where does popularity go for old or new movies?

In [None]:
#Which movies are more popular, new or old?

plt.figure(figsize=[18,6])
movie.groupby('release_year')['popularity'].mean().plot(kind='bar', title='Where does popularity go for old or new movies?')

plt.xlabel('release_year')
plt.ylabel('popularity')

*It is clear that the audience is very inclined towards modern films


# ## Conclusions
*The films with the highest rating, net worth 7, equal approximately 20%.

*There is a slight effect obviously caused by a higher budget, as films with a rating greater than or equal to 7 have a slightly higher average budget than those with lower ratings.

*The effect of the duration of the films shown was influential in one way or another, as the films that received a rating higher than or equal to 7 had a greater viewing period than the films with a rating of less than 7.

*From our review of the films that received the highest rating, the dramas were among the most rated films with the lowest ratings, with a comedy nature
*****************************************************************************************
*It turns out that I'm the oldest movies that got a higher rating

*The films with the highest ratings were not the same as the most popular, with less films receiving more popularity than some of the films with higher ratings.

*Recent movies are getting more and more popular


*As for the answer to which films are better, I think that modern films are better, as they got the highest popularity rate compared to old films. I also got a high rating compared to old films, despite the fact that older films got a higher rate of voting





# limitations
* Lots of missing values
* Irrational extremes such as having a budget of zero and running time of movies zero


In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])