## Data Join

This notebook is part of the Sentimetrix project. It aims to join the cleaned datasets collected in order to train the model.

### Import Libraries

The first step is to import the necessary libraries.

In [None]:
import pandas as pd

### File Paths

File paths for the cleaned datasets are defined as a list. You can modify the paths to match your directory structure.

In [3]:
# File paths to cleaned data files
file_paths = [
    '../data/cleaned/portuguese_data.csv',
    '../data/cleaned/financial_data.csv',
    '../data/cleaned/twitter_data.csv',
    '../data/cleaned/movies_data.csv'
]

### Read Datasets

The datasets are read from the CSV files using a loop. Each file is read into a separate DataFrame, stored in the _datasets_ list.

In [None]:
# Read the data files into a list of DataFrames
datasets = [pd.read_csv(file) for file in file_paths]

### Concatenate Datasets

The datasets are concatenated along the rows using _pd.concat()_, and only the 'text' and 'sentiment' columns are selected for the final dataframe.

In [None]:
# Concatenate datasets into one dataframe
main_df = pd.concat(datasets, ignore_index=True)[['text', 'sentiment']]

# Show the first 5 rows
main_df.head()

### Save Concatenated Data

The concatenated data is saved as both a CSV file and a Pickle file.

In [None]:
# Save concatenated data as CSV
csv_path = '../data/cleaned/main_data.csv'
main_df.to_csv(csv_path, index=False)

# Save concatenated data as Pickle
pickle_path = '../data/cleaned/main_data.pkl'
main_df.to_pickle(pickle_path)

### Display Concatenated Dataframe

The concatenated dataframe, _main_df_, is displayed below for visual inspection.

In [None]:
main_df