<a href="https://colab.research.google.com/github/marprezd/datasc-python-lab/blob/main/lab/bollywood.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Analytics in the Bollywood Dataset.
The data file `bollywood.csv` contains box office collection and social media promotion information about movies released in 2013−2015 period. 


In [None]:
# Import Pandas and Plotly libraries
import pandas as pd
import plotly.graph_objects as go 

In [None]:
# We will use the pd.read.csv method to read the bollywood.csv dataset and load it into a DataFrame.
df = pd.read_csv('drive/MyDrive/Colab Notebooks/data/bollywood.csv')

# Let's see the first few records of the DataFrame
df.head(5)

Unnamed: 0,SlNo,Release Date,MovieName,ReleaseTime,Genre,Budget,BoxOfficeCollection,YoutubeViews,YoutubeLikes,YoutubeDislikes
0,1,18-Apr-14,2 States,LW,Romance,36,104.0,8576361,26622,2527
1,2,4-Jan-13,Table No. 21,N,Thriller,10,12.0,1087320,1129,137
2,3,18-Jul-14,Amit Sahni Ki List,N,Comedy,10,4.0,572336,586,54
3,4,4-Jan-13,Rajdhani Express,N,Drama,7,0.35,42626,86,19
4,5,4-Jul-14,Bobby Jasoos,N,Comedy,18,10.8,3113427,4512,1224


## Structure of the Bollywood DataFrame.

This DataFrame has got the following columns:

- SlNo – Release Date
- MovieName – Name of the movie
- ReleaseTime – Mentions special time of release. **LW (Long weekend), FS (Festive Season), HS (Holiday Season), N (Normal)**
- Genre – Genre of the film such as **Romance, Thriller, Action, Comedy**, etc.
- Budget – Movie creation budget
- BoxOfficeCollection – Box office collection
- YoutubeViews – Number of views of the YouTube trailers
- YoutubeLikes – Number of likes of the YouTube trailers
- YoutubeDislikes – Number of dislikes of the YouTube trailers

We will use the `info()` method to explore the **metadata information** of the dataset.

In [None]:
# Print the metadata information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   SlNo                 149 non-null    int64  
 1   Release Date         149 non-null    object 
 2   MovieName            149 non-null    object 
 3   ReleaseTime          149 non-null    object 
 4   Genre                149 non-null    object 
 5   Budget               149 non-null    int64  
 6   BoxOfficeCollection  149 non-null    float64
 7   YoutubeViews         149 non-null    int64  
 8   YoutubeLikes         149 non-null    int64  
 9   YoutubeDislikes      149 non-null    int64  
dtypes: float64(1), int64(5), object(4)
memory usage: 11.8+ KB


In this dataset there are a total of 149 records in 10 columns, there are no lost values; 1 Column is floating type, 5 columns are integer type and 4 columns are of objects type. The memory consumption of the dataset is 11.8 kb.

## Let's start with the descriptive analysis.

### How many movies got released in each genre?

Since gender is a column with categorical variables we will use the value_counts() method that allows us to know the number of occurrences of a unique value in each of the columns.

In [None]:
# Print how many movies got released in each genre
df.Genre.value_counts()

Comedy       36
 Drama       35
Thriller     26
Romance      25
Action       21
Action        3
Thriller      3
Name: Genre, dtype: int64

As you can see at the output, the Comedy genre has got the largest number of releases followed very closely by the Drama genre. On the other hand, the Action and Thriller genres are those that have got a smaller number of releases.

### How many movies in each genre has got released in different release times like long weekend, festive season, etc?

To find the answer to this question we will use the crosstab() method that allows us to perform a cross-tabulation between the Genre and ReleaseTime columns.

In [None]:
# Find how many movies in each genre got released in different release times
pd.crosstab(df['Genre'], df['ReleaseTime'])

ReleaseTime,FS,HS,LW,N
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Drama,4,6,1,24
Action,3,3,3,12
Action,0,0,0,3
Comedy,3,5,5,23
Romance,3,3,4,15
Thriller,4,1,1,20
Thriller,0,0,1,2


As shown in the output, most of the genres throw their movies in the **normal season (N)**. The **Drama** genre is the one that has got the most releases in **Holiday Season (HS)** and **normal season (N)**.