# "Hype" is all you need #

## Introduction

The motivation behind the project is to work as a team with the idea of joining everthing we've seen, in other words:

Being able to design, research, develop and deploy a Data Science idea designing a Big Data Architecture from which to train a model with a conclusion in mind while being ethical and not breaking any EU laws

### Description

This is research into what defines the success of films, and whether success can be predicted (proportionally) based on the hype (expectation) generated around a film; to be able to be expandable with both series and anime, video games or any other type of multimedia content or not.

It is intended, as possible definitions of the success of a film, to be able to predict:

- The collection of a film based on its initial investment and how much good it will be received

- The acceptance/acclamation of a film with respect to the initial "hype"

- Predict the note on IMDB a week after release, and whoever says IMDB can say other platforms (Rotten Tomatoes, Metacritic)

- Predict your success (previously defined) one week after your release


For this, various data sources will be used, such as: Twitter, Reddit, YouTube, IMDB, and those that we can discover as the investigation progresses. One of the main and central components of the application is sentiment analysis, which would become the main focus of the prediction.

### Objectives

### Product

### Assumptions

#### Chosen Model

##### Why?

##### How does it work?

##### Why not...?

## Environment

### Imports

In [30]:
from google.colab import drive
import os

### Load ENV Variables

In [39]:
COLAB_MOUNT_PATH = "/content/drive/" #@param {type:"string"}

COLAB_UNIT_NAME = "MyDrive" #@param {type:"string"}

BASEPATH = "Colab Notebooks" #@param {type:"string"}

### Load system elements

The argument `-q` is used as to not overflow the notebook with the installation progress

In [6]:
%pip install tensorflow -q

### Mount the drive

In [40]:
drive.mount(COLAB_MOUNT_PATH)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


### Move to the path

Generate the directory's full path

In [41]:
working_path = os.path.join( COLAB_MOUNT_PATH, COLAB_UNIT_NAME, BASEPATH )
print(f'The working directory has been detected as:', working_path)

The working directory has been detected as: /content/drive/MyDrive/Colab Notebooks/NLP


Attempt to move to the directory

In [42]:
try:
  os.chdir( working_path )
  print('Moved successfully to the desired directory')
except Exception:
  print('Coudln\'t move to the desired directory')

Moved successfully to the desired directory


Now the current directory is

In [43]:
%pwd

'/content/drive/MyDrive/Colab Notebooks/NLP'

## Initialization

At this point we need to instantiate all the data related to our model

### Imports

In [45]:
import pandas as pd
import numpy as np
from typing import List, Tuple

### Load the dataset

In [None]:
dataframe = pd.read_csv('dataset.csv')

### Check that it's right

## EDA

### Constants

This should not be touched lightly, these are values you can modify by giving a value, they only serve as the default, and may affect to many cells

In [2]:
FIGURE_WIDTH = 30 #@param {type:"number"}
FIGURE_HEIGHT = 10 #@param {type:"number"}

WHITEGRID = 'whitegrid' #@param {type:"string"}
WHITE = 'white' #@param {type:"string"}
COLOR_MAP = 'BuGn' #@param {type:"string"}

SUBPLOT_ADJUSTMENT = {
  'left'  : 0.1,
  'bottom': 0.1,
  'right' : 0.9,
  'top'   : 0.9,
  'wspace': 0.4,
  'hspace': 0.4,
}

### Imports

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.api.types import is_numeric_dtype, is_string_dtype

### Helpers

In [None]:
def set_plotting_style(
  style: str = WHITE,
  fig_width: float = FIGURE_WIDTH,
  fig_height: float = FIGURE_HEIGHT,
) -> None:
  """Configures the plotting style"""
  sns.set_theme(style=style)
  plt.figure(figsize=(fig_width, fig_height))

In [None]:
def get_df_with_cols(
  dataframe = pd.DataFrame,
  cols: Tuple[str] = None,
  only_numeric: bool = False,
  only_strings: bool = False,
) -> pd.DataFrame:
  """Gets the dataframe with the given columns"""
  # Backup the dataframe
  df = dataframe.copy()

  # If no columns are given, get every column
  if not cols:
    columns = tuple(df.columns)

  def filter_by(col) -> bool:
    """Determines if a col should be stored"""
    if only_numeric:
      return is_numeric_dtype(col)
    elif only_strings:
      return is_string_dtype(col)

    return True

  columns = [ col for col in columns if filter_by(df[col]) ]
  df = df[ columns ]

  return df

### Missing Values

It will return the percentage **per** columns, which will work wonders for an overall view

In [None]:
def get_na_percentage(
  dataframe: pd.DataFrame,
) -> pd.DataFrame:
  """Returns the percentage of missing values in a dataframe per column"""
  return dataframe.isna().sum() * 100 / len(dataframe)

get_na_percentage(dataframe)

While now we may want to retrieve the percentage on the whole dataset

In [None]:
def get_total_na_percentage(
  dataframe: pd.DataFrame,
) -> float:
  """Returns the total percentage of missing values in a dataframe"""
  return dataframe.isna().sum().sum() * 100 / (len(dataframe) * len(dataframe.columns))

get_total_na_percentage(dataframe)

### Correlation

The correlation helps us identify which values contribute and further explain the dependant variable

In [None]:
def get_correlation(
  dataframe=pd.DataFrame,
) -> pd.DataFrame:
  """Gets the DataFrame correlation"""
  return get_df_with_cols(
    dataframe=dataframe,
    only_numeric=True
  ).corr()

correlation = get_correlation(dataframe)

To easily identify the correlation we create a method to plot it

In [None]:
def plot_correlation(
  dataframe=pd.DataFrame,
  cmap: str = COLOR_MAP,
) -> None:
  """Plots the correlation"""
  set_plotting_style(fig_width=15)

  sns.heatmap(
    correlation,
    fmt='g',
    annot=True,
    cmap=COLOR_MAP,
  )

### Data distribution

It's important to see if the data is balanced or there may be some adjustments to make. We want our model as unbiased as possible with a decent amount of variance

#### Helpers

Prints all the given columns as the provided plotting method

In [None]:
def print_cols_as(
  df: pd.DataFrame,
  method: callable,
  cols: Tuple[str] = None,
  number_of_cols: int = 4,
  style: str = WHITE,
  params: dict = {},
  fig_width: float = FIGURE_WIDTH,
  fig_height: float = FIGURE_HEIGHT,
  subplots_adjustment: dict = SUBPLOT_ADJUSTMENT,
) -> None:
  """Prints all the columns as the given method"""
  df = get_df_with_cols( dataframe, cols, only_numeric=True )

  # Configure the plotting style
  set_plotting_style( style, fig_width, fig_height )

  # Configure the subplots
  columns = df.columns
  n_cols = len(columns)
  fig, axes = plt.subplots(int(n_cols / number_of_cols) + 1, number_of_cols, **params)

  # set the spacing between subplots
  plt.subplots_adjust(**subplots_adjustment)

  for index, col in enumerate(columns):
    method(ax=axes[int(index / number_of_cols), int(index % number_of_cols)], x=df[col])

Prints in one method all the given columns

In [None]:
def plot_overall(
  dataframe: pd.DataFrame,
  method: callable,
  cols: Tuple[str] = None,
  style: str = WHITE,
  params: dict = {},
  fig_width: float = FIGURE_WIDTH,
  fig_height: float = FIGURE_HEIGHT,
) -> None:
  """Plots the dataset"""
  df = get_df_with_cols( dataframe, cols, only_numeric=True )

  # Configure the plotting style
  set_plotting_style( style, fig_width, fig_height )

  # Actually plot
  method(data=df, **params)

#### Boxplot

The total Data Distribution

In [None]:
def boxplot_distribution(
  dataframe: pd.DataFrame,
  cols: Tuple[str] = None,
  number_of_cols: int = 4,
) -> None:
  """Boxplots the dataset's distribution"""
  print_cols_as(
    dataframe=dataframe,
    cols=cols,
    number_of_cols=number_of_cols,
    style=WHITEGRID,
    method=sns.boxplot,
    params={
      'orient':"h",
      'palette':"Set2",
    }
  )

boxplot_distribution( dataframe )

The overall view

In [None]:
def boxplot_overall(
  dataframe: pd.DataFrame,
  cols: Tuple[str] = None,
) -> None:
  """Boxplots the dataset"""
  plot_overall(
    dataframe=dataframe,
    cols=cols,
    style=WHITEGRID,
    method=sns.boxplot,
    params={
      'orient':"v",
      'palette':"Set2",
    }
  )

boxplot_overall( dataframe )

#### Histplot

The total Data Distribution

In [None]:
def histplot_distribution(
  dataframe: pd.DataFrame,
  cols: Tuple[str] = None,
  number_of_cols: int = 4,
) -> None:
  """Histplots the dataset's distribution"""
  print_cols_as(
    dataframe=dataframe,
    cols=cols,
    number_of_cols=number_of_cols,
    style=WHITEGRID,
    method=sns.histplot,
  )

boxplot_distribution( dataframe )

The overall view

In [None]:
def histplot_overall(
  dataframe: pd.DataFrame,
  cols: Tuple[str] = None,
) -> None:
  """Histplots the dataset"""
  plot_overall(
    dataframe=dataframe,
    cols=cols,
    style=WHITEGRID,
    method=sns.histplot,
  )

histplot_overall( dataframe )

## Preprocessing

### Backup

Create a copy just in case we need to reload the dataframe again

In [None]:
df = dataframe.copy()

### Imports

In [51]:
from sklearn.impute import SimpleImputer

### Clean the data

Fill, remove, and drop all the columns and values that we may need

### Map the data

#### Col

We need to convert the data from the column "col" to "" because...

### Regularization

### Normalization

We apply the MinMax Normalization, because...

In [None]:
def normalize(dataframe: pd.DataFrame) -> pd.DataFrame:
  """Normalizes a DataFrame"""
  df = dataframe.copy()

  return df

df = normalize(df)

## Modelling

Now that we have some quality data, it's time to start modelling

### Imports

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split

### Constants

In [None]:
TEST_SIZE = .15 #@param {'type': 'number'}

VALIDATION_SIZE = .4 #@param {'type': 'number'}

SEED = 42 #@param {'type': 'integer'}

### Set the seed

Only for training, evaluation and researching purposes. It should NOT be deployed to production with a fixed seed.

It, greatly, helps identify any sort of problem since it should always give the same outcome.

In [48]:
np.random.seed(SEED) # The answer to all the questions in the universe

### Value assignment

It is time to identify the label and separate the columns

In [None]:
target = 'label'
columns = [ col for col in df.columns if col is not target ]

And we assign **X** and **y**

In [None]:
X = df[ columns ]
y = df[ target ]

### Validation split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=SEED)
X_test, X_validation, y_test, y_validation = train_test_split(X_test, y_test, test_size=TEST_SIZE, random_state=SEED)

### Model Preprocessing

### Model Building

In [50]:
def build_model():
  """Builds the model"""
  model = None

  return model

model = build_model()

### Training

In [None]:
def train_model(model):
  """Trains the model"""
  pass

# train_model(model)

### Storage

It is important to have a model ready to use after it's been trained, instead of going through the whole process of cleaning the data, and retraining it each time we may want to use it.

#### Save

In [None]:
def save_model(model) -> bool:
  """Saves the model"""
  pass

model_was_saved = save_model(model)

### Load

In [None]:
def load_model():
  model = None

  return model

# model = load_model()
loaded_model = load_model()

## Evaluation

### Basic metric

In [None]:
def evalute(model):
  """Evaluates the model"""
  pass

score = evalute(model)

### In-detail

### Explanation

This is where any necessary comprobation about the results and it's hypothesis will be, after it's already trained and evaluated

## Data Analysis

## Data Mining