# Data Analysis Example 
# Superhero Movies

 - Explore the data with `info(), describe(), head()`
 - How many DC? Marvel? `value_counts()`
 - Highest Rated imdb movie? Lowest?
 - dropping NaN values


#### The info() method shows information about the DataFrame. Specifically the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

In [2]:
import pandas as pd

sh = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/master/superhero/superhero-movie-dataset-1978-2012-header.csv")
sh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Year                         49 non-null     int64  
 1   Title                        49 non-null     object 
 2   Comic                        49 non-null     object 
 3   IMDB Score                   49 non-null     float64
 4   RT Score                     49 non-null     int64  
 5   Composite Score              49 non-null     float64
 6   Opening Weekend  Box Office  46 non-null     float64
 7   Avg Ticket Price             49 non-null     float64
 8   Opening Weekend Attendance   46 non-null     float64
 9   US Population That Year      49 non-null     int64  
dtypes: float64(5), int64(3), object(2)
memory usage: 4.0+ KB


#### The .describe() provides summary statistics for numerical columns in our DataFrame.

In [3]:
sh.describe()

Unnamed: 0,Year,IMDB Score,RT Score,Composite Score,Opening Weekend Box Office,Avg Ticket Price,Opening Weekend Attendance,US Population That Year
count,49.0,49.0,49.0,49.0,46.0,49.0,46.0,49.0
mean,2001.326531,6.212245,53.204082,57.663265,56201260.0,5.963061,8654924.0,284476300.0
std,9.764706,1.530201,29.643001,21.815368,47600470.0,1.667506,6345187.0,27849880.0
min,1978.0,2.7,8.0,19.5,870068.0,2.34,189557.3,222584500.0
25%,1997.0,5.3,26.0,39.5,16228060.0,4.59,3302858.0,267783600.0
50%,2004.0,6.4,59.0,62.5,52659760.0,6.21,7773355.0,293045700.0
75%,2008.0,7.4,79.0,75.0,65557130.0,7.18,11258070.0,304374800.0
max,2012.0,9.1,95.0,91.5,207438700.0,7.93,26191760.0,314056000.0


#### Let us look at some of the data. You can use .head() to get the first 5 rows or .tail() to get the last 5 rows. To obtain random rows use .sample() method

In [4]:
#look at some of the data
sh.head()

Unnamed: 0,Year,Title,Comic,IMDB Score,RT Score,Composite Score,Opening Weekend Box Office,Avg Ticket Price,Opening Weekend Attendance,US Population That Year
0,1978,Superman,DC,7.3,95,84.0,7465343.0,2.34,3190317.521,222584545
1,1980,Superman II,DC,6.7,88,77.5,14100523.0,2.69,5241830.112,227224681
2,1982,Swamp Thing,DC,5.3,60,56.5,,2.94,,231664458
3,1983,Superman III,DC,4.9,24,36.5,13352357.0,3.15,4238843.492,233791994
4,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.202,235824902


In [6]:
# are they all DC comics? Try a random same of 10
sh.sample(n=10)

Unnamed: 0,Year,Title,Comic,IMDB Score,RT Score,Composite Score,Opening Weekend Box Office,Avg Ticket Price,Opening Weekend Attendance,US Population That Year
43,2011,Thor,Marvel,7.0,77,73.5,65723338.0,7.93,8287937.0,311591917
25,2005,Batman Begins,DC,8.3,85,84.0,48745440.0,6.41,7604593.0,295753151
11,1995,Batman Forever,DC,5.4,42,48.0,52784433.0,4.35,12134350.0,262803276
39,2010,Iron Man 2,Marvel,7.1,74,72.5,128122480.0,7.89,16238590.0,308745538
45,2012,Marvel's The Avengers,Marvel,8.7,92,89.5,207438708.0,7.92,26191760.0,314055984
35,2008,Iron Man,Marvel,7.9,94,86.5,98618668.0,7.18,13735190.0,304374846
29,2006,X-Men: The Last Stand,Marvel,6.8,57,62.5,102750665.0,6.55,15687120.0,298593212
4,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.0,235824902
27,2005,Fantastic Four,Marvel,5.7,27,42.0,56061504.0,6.41,8745944.0,295753151
37,2009,Watchmen,DC,7.7,64,70.5,55214334.0,7.5,7361911.0,307006550


#### The column "comic" contains nominal or categorical data (e.g., names of comics), the .value_counts() method will return the count of unique values in that column. This output number of occurrences of each comic in descending order.

In [10]:
## Who has more movies in the dataset? DC or Marvel?
sh['comic'].value_counts()

Marvel    29
DC        19
Name: comic, dtype: int64

#### If set normalize to True we return relative frequency (proportion) of each unique value in the 'comic' column of the sh DataFrame. Instead of just showing the counts, it will show the proportion of total entries that each unique value represents. For example, if a comic appears 3 times in a column with a total of 10 entries, the result for that comic would be 0.3 (i.e., 30%).

In [11]:
## let's see that as a percentage of the total
sh['comic'].value_counts(normalize=True)

Marvel    0.604167
DC        0.395833
Name: comic, dtype: float64

In [12]:
## what are the ratios in the last 10 years of data ?
sh[ sh['year'] >2002]['comic'].value_counts(normalize=True)

Marvel    0.741935
DC        0.258065
Name: comic, dtype: float64

In [13]:
# what about the first 10 years of data? 1978 - 1987?
sh[ sh['year'] < 1988]['comic'].value_counts(normalize=True)

DC        0.833333
Marvel    0.166667
Name: comic, dtype: float64

In [14]:
sh.head()

Unnamed: 0,year,title,comic,imdb,rt,composite,opening_weeked_bo,avg_ticket_price,opening_weekend_attend,us_pop_that_year
0,1980,Superman II,DC,6.7,88,77.5,14100523.0,2.69,5241830.112,227224681
1,1982,Swamp Thing,DC,5.3,60,56.5,,2.94,,231664458
2,1983,Superman III,DC,4.9,24,36.5,13352357.0,3.15,4238843.492,233791994
3,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.202,235824902
4,1986,Howard the Duck,Marvel,4.3,16,29.5,5070136.0,3.71,1366613.477,240132887


#### Let us create a new DataFrame sh2 that is a copy of sh, but with all rows containing any NaN (missing) values removed. In other words, it filters out all incomplete rows from the DataFrame.

In [8]:
## skip nulls in analysis
sh2 = sh.dropna()
sh2.head()

Unnamed: 0,Year,Title,Comic,IMDB Score,RT Score,Composite Score,Opening Weekend Box Office,Avg Ticket Price,Opening Weekend Attendance,US Population That Year
0,1978,Superman,DC,7.3,95,84.0,7465343.0,2.34,3190317.521,222584545
1,1980,Superman II,DC,6.7,88,77.5,14100523.0,2.69,5241830.112,227224681
3,1983,Superman III,DC,4.9,24,36.5,13352357.0,3.15,4238843.492,233791994
4,1984,Supergirl,DC,4.2,8,25.0,5738249.0,3.36,1707812.202,235824902
5,1986,Howard the Duck,Marvel,4.3,16,29.5,5070136.0,3.71,1366613.477,240132887


#### Let us finds the highest value in the 'IMDB Score' column of the sh2 DataFrame and stores it in the variable.

In [13]:
# Movie with the best IMDB score?

In [12]:
best_imdb = sh2['IMDB Score'].max()
best_imdb

9.1

#### Let us filter the sh2 DataFrame to return all rows where the 'IMDB Score' is equal to the maximum score stored in the variable best_imdb. This will give you row(s) corresponding to the movie(s) with the highest IMDB score in the DataFrame. It's a way of identifying the specific entries with the maximum score. We see only one movie has the score of 9.1.

In [14]:
sh2[ sh2['IMDB Score'] == best_imdb ]

Unnamed: 0,Year,Title,Comic,IMDB Score,RT Score,Composite Score,Opening Weekend Box Office,Avg Ticket Price,Opening Weekend Attendance,US Population That Year
46,2012,The Dark Knight Rises,DC,9.1,86,88.5,160887295.0,7.92,20314052.4,314055984
