## Data analysis
the training set after segmentation was used for initial data analysis.

In [None]:
import pandas as pd

import numpy as np

import seaborn as sns
import os
import yaml
import json

import matplotlib.pyplot as plt

cwd = os.getcwd()

if cwd == "/app/scripts":
    filepath = "../data/segmentation/train.csv"
else:
    params = yaml.safe_load(open('params.yaml'))['analysis']
    filepath = params['datasets']
    
df = pd.read_csv(filepath)
df.head().T


In [None]:
df = df.drop("Unnamed: 0", axis=1)
print("")
df.columns

### Null values analysis

In [None]:
df.isna().sum()

In [None]:
df.dtypes

In [None]:
from pandas.api.types import is_numeric_dtype

for col in df.columns:
    if is_numeric_dtype(df[col]):
        df.fillna((df[col].median()), inplace=True)
    else:
        df.fillna("", inplace=True)

df.isna().sum()

### Categorical and numerical variables

In [None]:
df_categorical_features = df.select_dtypes(include='object')
df_categorical_features.describe()

### Correlations


In [None]:

sns.pairplot(df, hue="overall")
plt.show()

In [None]:
corr = df.corr()
sns.set_style(style = 'white')
mask = np.triu(np.ones_like(corr, dtype=bool))

f, ax = plt.subplots(figsize=(20, 20))

cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(corr, mask=mask,  cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
df.corr()

Basic correlation analysis (Pairplot, correlation heatmap and tabular presentation) was performed on data with numerical representation. An apparent positive correlation was observed for text length and word count, which is not revealing. Positive correlation occurred between data that contain verified opinions and verified and additionally those that were voted for. A weak positive correlation is also seen between the length of reviews and the number of words and votes cast, which may indicate that longer reviews are more helpful.
A negative correlation occurred between whether a review was reviewed and the length of the review.
The other data in the numerical summary above did not show strong correlations.

Based on the graphs of correlations between groups with 'overall' as a category, it was not observed that for any of the characteristics the output groups are separated. Regarding the distribution of variables (see pairplot), no correlation was observed with the type of opinion, the only apparent correlation is due to the fact that there are significantly more opinions mapped as positive than neutral and negative, as shown numerically below.

In [None]:
print("Number of positives: ", df[df['overall'] == 'positive'].shape[0])
print("Number of neutral: ", df[df['overall'] == 'neutral'].shape[0])
print("Number of negatives: ", df[df['overall'] == 'negative'].shape[0])

### Text data analysis


In [None]:
df["reviewText"].str.lower()
df_new = df.copy()
df_new["reviewText"] = df_new["reviewText"].str.lower()
df_new["reviewText"].str.split(expand=True).stack().value_counts().head(30)



In conclusion, based on the extraction of adjectives alone, it is impossible to define sentiment, single words without context do not say much, which is also indicated by the appearance of 'good; in all three groups, which indicates a positive overtone, however, could have indicated something like "I was searching for something good but I wasn't this item".