## Sentiment Dataset

In [1]:
import sklearn as skl
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing

*Load the CSV*

In [5]:
'''
Load CSV
'''
df = pd.read_csv('sentiment_analysis.csv')

**1. Understand the data**

* .columns
* .head()
* .tail()
* .shape
*.dtypes
* .info()
* .describe()
* .isna()

In [16]:
'''
Inspect data
'''
print(df.columns)


Index(['CustomerKey', 'WebActivity', 'Sentiment Analysis', 'SentimentRating',
       'MaritalStatus', 'Gender', 'EstimatedYearlyIncome', 'NumberOfContracts',
       'Age', 'Target', 'Available401K', 'CustomerValueSegment', 'ChurnScore',
       'CallActivity', 'Products', 'birthday'],
      dtype='object')


In [15]:
df.tail()

Unnamed: 0,CustomerKey,WebActivity,Sentiment Analysis,SentimentRating,MaritalStatus,Gender,EstimatedYearlyIncome,NumberOfContracts,Age,Target,Available401K,CustomerValueSegment,ChurnScore,CallActivity,Products,birthday
15162,13748,0,Positive,4,M,M,80000,0,52,0,1,2,0.1,3,fund manager+,1963-12-10
15163,12104,1,Positive,4,M,F,70000,2,61,1,1,2,1.0,4,p+b investment,1954-11-05
15164,12104,1,Positive,4,M,F,70000,2,61,1,1,2,1.0,4,p+b investment,1954-12-19
15165,13120,5,Very Positive,5,M,M,80000,4,45,0,1,1,0.0,4,private investment,1971-03-18
15166,13120,5,Very Positive,5,M,M,80000,4,45,0,1,1,0.0,4,private investment,1970-06-05


In [14]:
df.shape

(15167, 16)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15167 entries, 0 to 15166
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CustomerKey            15167 non-null  int64  
 1   WebActivity            15167 non-null  int64  
 2   Sentiment Analysis     15167 non-null  object 
 3   SentimentRating        15167 non-null  int64  
 4   MaritalStatus          15167 non-null  object 
 5   Gender                 15167 non-null  object 
 6   EstimatedYearlyIncome  15167 non-null  int64  
 7   NumberOfContracts      15167 non-null  int64  
 8   Age                    15167 non-null  int64  
 9   Target                 15167 non-null  int64  
 10  Available401K          15167 non-null  int64  
 11  CustomerValueSegment   15167 non-null  int64  
 12  ChurnScore             15167 non-null  float64
 13  CallActivity           15167 non-null  int64  
 14  Products               15167 non-null  object 
 15  bi

In [12]:
df.dtypes


CustomerKey                int64
WebActivity                int64
Sentiment Analysis        object
SentimentRating            int64
MaritalStatus             object
Gender                    object
EstimatedYearlyIncome      int64
NumberOfContracts          int64
Age                        int64
Target                     int64
Available401K              int64
CustomerValueSegment       int64
ChurnScore               float64
CallActivity               int64
Products                  object
birthday                  object
dtype: object

In [17]:
df.describe()

Unnamed: 0,CustomerKey,WebActivity,SentimentRating,EstimatedYearlyIncome,NumberOfContracts,Age,Target,Available401K,CustomerValueSegment,ChurnScore,CallActivity
count,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0,15167.0
mean,17559.847102,0.999473,1.850926,57718.07213,1.465484,48.203402,0.486781,0.69638,2.097251,0.268893,3.236896
std,5576.039383,1.519967,1.619925,32091.910319,1.144962,11.300184,0.499842,0.459836,0.688901,0.332298,1.26236
min,11000.0,0.0,0.0,10000.0,0.0,29.0,0.0,0.0,1.0,0.0,1.0
25%,12256.5,0.0,0.0,30000.0,1.0,40.0,0.0,0.0,2.0,0.0,2.0
50%,14967.0,0.0,2.0,60000.0,1.0,46.0,0.0,1.0,2.0,0.1,3.0
75%,23045.5,2.0,3.0,70000.0,2.0,56.0,1.0,1.0,3.0,0.5,4.0
max,27336.0,5.0,5.0,170000.0,4.0,100.0,1.0,1.0,3.0,1.0,5.0


In [19]:
'''
Missing data
'''
print(df.isna().sum())

CustomerKey              0
WebActivity              0
Sentiment Analysis       0
SentimentRating          0
MaritalStatus            0
Gender                   0
EstimatedYearlyIncome    0
NumberOfContracts        0
Age                      0
Target                   0
Available401K            0
CustomerValueSegment     0
ChurnScore               0
CallActivity             0
Products                 0
birthday                 0
dtype: int64


In [20]:
df.isna().any()

CustomerKey              False
WebActivity              False
Sentiment Analysis       False
SentimentRating          False
MaritalStatus            False
Gender                   False
EstimatedYearlyIncome    False
NumberOfContracts        False
Age                      False
Target                   False
Available401K            False
CustomerValueSegment     False
ChurnScore               False
CallActivity             False
Products                 False
birthday                 False
dtype: bool

**Find the goal**

*Goal:* identify the type of wine by its properties - the target is a numeric categorical variable that covers the values of 0, 1 and 2

*Modeling:* use the features of the wine to predict its type

**2. Data preparation and transformation**

- drop useless columns
- rename columns 
- handle missing values
- handle duplication
- create new features

Since all the variables appear to be physical-chemical measures, they could all be useful and help define the segmentation of the type of wine. There is no reason to remove columns

In [21]:
'''
Drop Duplicates
'''
print(df.duplicated().sum())
print(df.drop_duplicates(inplace=True))
print(df.info())

16
None
<class 'pandas.core.frame.DataFrame'>
Index: 15151 entries, 0 to 15166
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CustomerKey            15151 non-null  int64  
 1   WebActivity            15151 non-null  int64  
 2   Sentiment Analysis     15151 non-null  object 
 3   SentimentRating        15151 non-null  int64  
 4   MaritalStatus          15151 non-null  object 
 5   Gender                 15151 non-null  object 
 6   EstimatedYearlyIncome  15151 non-null  int64  
 7   NumberOfContracts      15151 non-null  int64  
 8   Age                    15151 non-null  int64  
 9   Target                 15151 non-null  int64  
 10  Available401K          15151 non-null  int64  
 11  CustomerValueSegment   15151 non-null  int64  
 12  ChurnScore             15151 non-null  float64
 13  CallActivity           15151 non-null  int64  
 14  Products               15151 non-null  object 
 15 

In [None]:
'''
Rename complicated columns' names
'''  
...

In [None]:
'''
Remove values (Ash smaller than 2, Alcalinity bigger than 15)
'''
...

**2. Univariate analysis**

Iterate through each and every relevant variable and get basic information such as

- .hist()
- .value_counts()
- .skew()
- .kurt()

In [None]:
'''
Categorical variables
'''
...

In [None]:
...

In [None]:
'''
Numeric variables
'''
...

In [None]:
...

Does not follow a normal curve and has spikes.

In [None]:
...

In [None]:
...

Kurtosis and asymmetry values are greater than 1.

*Summarize the dataset*

- Variable: name of the variable
- Type: the type or format of the variable. This can be categorical, numeric, Boolean, and so on
- Context: useful information to understand the semantic space of the variable. In the case of our dataset, the context is always the chemical-physical one
- Expectation: how relevant is this variable with respect to our task? We can use a scale “High, Medium, Low”.
- Comments: whether or not we have any comments to make on the variable

**3. Multivariate analysis**

- grouping
- bins
- statistical dispersion: histogram, box plots, scatter plots, pair plots, correlation matrixes

> scatterplots: plot 2 variables against each other and understand how they move together

> pairplots: plot all variables against each other and understand how they move together

In [None]:
'''
All variables
'''
...

In [None]:
'''
Grouping
'''
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
'''
Bins
'''
# https://scikit-learn.org/stable/modules/preprocessing.html#discretization
...
print('Bin Edges')
...
print('Alcohol Groups')
print(...)

In [None]:
'''
Statistical Dispersion
'''
...
fig.suptitle('Histograms')

...
sns.distplot(df['Flavanoids'], ax=axs[0, 1], kde=True)
...
sns.distplot(df['Proline'], ax=axs[1, 1], kde=True)

The best way to understand the relationship between a numeric variable and a categorical variable is through a boxplot.

In [None]:
'''
Box plots (Outliers)
'''
...
plt.title("Boxplot for Class vs Proline")
...

In [None]:
...
plt.title("Boxplot for Class vs Flavanoids")
...

In [None]:
_, ax = plt.subplots(figsize=(15, 6))
...


In [None]:
...
fig.suptitle('Boxplots for 4 variables')
sns.boxplot(y=df['Color intensity'], ax=axs[0, 0])
...
sns.boxplot(y=df['Alcohol'], ax=axs[1, 1])

In [None]:
'''
Scatter plots
'''
...
plt.show()

In [None]:
'''
Relations
'''
...

In [None]:
'''
Correlation
'''
...

**Critical analysis of the previous results**

* Which components characterize the various types of wine?
* Which component is the most significant?