# EDA (Exploratory Data Analysis)

## Main commands in Python packages:

### Analysis:



**1) Pandas:**
- Importing Pandas: import pandas as pd
- shape: Get the dimensions of the dataframe (rows, columns).
- head(): Displays the first few rows of the DataFrame.
- info(): Provides a concise summary of the DataFrame, including data types and missing values.
- describe(): Generates descriptive statistics of the DataFrame's numerical columns.
- value_counts(): Computes frequency counts of unique values in a column.
- groupby(): Groups data based on specified criteria.
- pd.pivot_table(): Creates a spreadsheet-style pivot table as a DataFrame.
- isnull(), notnull(): Checks for missing values in the DataFrame.
- drop_duplicates(): Remove duplicate rows.
- astype(): Convert data types of columns.

**2) NumPy:**
- Importing NumPy: import numpy as np
- np.mean(), np.median(), np.std(): Computes mean, median, and standard deviation of numerical data.
- np.min(), np.max(): Finds the minimum and maximum values in an array.
- np.percentile(), np.quantile(): Calculates percentiles and quantiles to understand data distribution.
- np.corrcoef(): Computes the correlation coefficient between two variables.
- np.histogram(): Generate histogram counts.

**3) SciPy:**
- Importing SciPy: import scipy as stats
- stats.describe(): Generates descriptive statistics for a given array or DataFrame.
- stats.ttest_ind(): Performs a t-test to compare means of two independent samples.
- stats.pearsonr(), stats.spearmanr(): Calculates Pearson and Spearman correlation coefficients between two variables.

### Visualization:

**1) Matplotlib**:
- Importing Matplotlib: import matplotlib.pyplot as plt
- plt.plot(), plt.scatter(): Create line plots and scatter plots.
- plt.hist(): Plots histograms to show the distribution of a single variable.
- plt.boxplot(): Visualize the distribution of numerical data using box plots.
- plt.bar(), plt.barh(): Create bar plots.
- plt.xlabel(), plt.ylabel(), plt.title(): Set labels and titles for the plot.

**2) Seaborn**:
- Importing Seaborn: import seaborn as sns
- sns.scatterplot(): Creates a scatter plot to visualize relationships between two variables.
- sns.pairplot(): Creates a matrix of scatter plots for pairwise relationships in the dataset.
- sns.heatmap(): Displays a heatmap to visualize the correlation matrix between variables.
- sns.boxplot(), sns.violinplot(): violinplot(): Draws box plots or violin plots to identify outliers and visualize distribution across different categories.
- sns.distplot(): Plots histograms to show the distribution of a single variable.

**3) Plotly** (for interactive visualizations):
- Importing Plotly: import plotly.express as px 
- scatter(): Creates interactive scatter plots.
- histogram(), density_heatmap(): Generates interactive histograms and density heatmaps.
- box(), violin(): Produces interactive box plots and violin plots.

## EDA packages in Python

In [3]:
# Importing main libraries:
import pandas as pd
import numpy as np
import datetime

# Statistics and EDA libraries:
import scipy as stats
import sweetviz as sv
from ydata_profiling import ProfileReport  # former pandas_profiling!

# Plotting libraries:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Setting-up the display options for Pandas dataframes to display all the columns (not truncated):
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Creating a class for different print styles:
class style:
    #-------------------
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    RED = '\033[91m'
    #-------------------
    YELLOW = '\033[93m'
    PURPLE = '\033[95m'
    CYAN = '\033[96m'
    DARKCYAN = '\033[36m'
    #-------------------
    END = '\033[0m'
    #-------------------

In [10]:
# Read the CSV file into a Pandas DataFrame
try:
    data = pd.read_csv('Project_01_Linkedin_jobs/dataset/job_postings.csv')
    print('File was succesfully loaded to the dataframe!')
except:
    print('File was not loaded into the dataframe!')

File was succesfully loaded to the dataframe!


### **Sweetviz**

[**Sweetviz**](https://pypi.org/project/sweetviz/) is an open-source Python library that generates beautiful, high-density visualizations to kickstart **EDA (Exploratory Data Analysis)** with just **two lines of code**. Output is a fully self-contained **HTML application**.

The system is built around quickly **visualizing target values** and **comparing datasets**. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

In [None]:
import sweetviz as sv

my_report = sv.analyze(data)
my_report.show_html(filepath='EDA_sweetviz_report.html', open_browser=True, layout='widescreen', scale=None)

### **ydata profiling** (former Pandas profiling)

The **pandas-profiling** package name was recently changed to [**ydata-profiling**](https://pypi.org/project/ydata-profiling/)!

[**ydata-profiling**](https://pypi.org/project/ydata-profiling/) is used to generate a **complete and exhaustive report for the dataset**, with many features and customizations in the generated report. This report includes various pieces of information such as dataset **statistics**, **distributions** of values, **missings**, **memory usage**, etc., which are very useful for exploring and analyzing data efficiently.

ydata profiling also helps a lot in **Exploratory Data Analysis (EDA)**. EDA is used to understand the underlying structure of data, detect patterns, and generate insights in a visual format.

In [11]:
from ydata_profiling import ProfileReport

profile = ProfileReport(data, title="Pandas Profiling Report", explorative=True)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'OffsiteApply'')
  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

AttributeError: 'NoneType' object has no attribute 'replace'

## Examples of code:

In [None]:
# Column data types:
data.dtypes

job_id                          int64
company_id                    float64
title                          object
description                    object
max_salary                    float64
med_salary                    float64
min_salary                    float64
pay_period                     object
formatted_work_type            object
location                       object
applies                       float64
original_listed_time          float64
remote_allowed                float64
views                         float64
job_posting_url                object
application_url                object
application_type               object
expiry                        float64
closed_time                   float64
formatted_experience_level     object
skills_desc                    object
listed_time                   float64
posting_domain                 object
sponsored                       int64
work_type                      object
currency                       object
compensation

In [None]:
# Showing the count of NULLs in each column (descending count order):
null_counts = data.isnull().sum().sort_values(ascending=False)
null_counts

skills_desc                   32909
closed_time                   32074
med_salary                    31005
remote_allowed                28444
max_salary                    22135
min_salary                    22135
compensation_type             19894
pay_period                    19894
currency                      19894
applies                       17008
posting_domain                13558
application_url               12250
formatted_experience_level     9181
views                          7360
company_id                      654
description                       1
sponsored                         0
work_type                         0
listed_time                       0
job_id                            0
job_posting_url                   0
expiry                            0
application_type                  0
original_listed_time              0
location                          0
formatted_work_type               0
title                             0
scraped                     

In [None]:
# Basic info including missings:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33246 entries, 0 to 33245
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   job_id                      33246 non-null  int64  
 1   company_id                  32592 non-null  float64
 2   title                       33246 non-null  object 
 3   description                 33245 non-null  object 
 4   max_salary                  11111 non-null  float64
 5   med_salary                  2241 non-null   float64
 6   min_salary                  11111 non-null  float64
 7   pay_period                  13352 non-null  object 
 8   formatted_work_type         33246 non-null  object 
 9   location                    33246 non-null  object 
 10  applies                     16238 non-null  float64
 11  original_listed_time        33246 non-null  float64
 12  remote_allowed              4802 non-null   float64
 13  views                       258

In [None]:
# Selecting columns by their data type:
data.select_dtypes('int64').head()

Unnamed: 0,job_id,sponsored,scraped
0,3757940104,0,1699138101
1,3757940025,0,1699085420
2,3757938019,0,1699085644
3,3757938018,0,1699087461
4,3757937095,0,1699085346


In [None]:
data_counts['NULLs']

skills_desc                   32909
closed_time                   32074
med_salary                    31005
remote_allowed                28444
max_salary                    22135
min_salary                    22135
compensation_type             19894
pay_period                    19894
currency                      19894
applies                       17008
posting_domain                13558
application_url               12250
formatted_experience_level     9181
views                          7360
company_id                      654
description                       1
sponsored                         0
work_type                         0
listed_time                       0
job_id                            0
job_posting_url                   0
expiry                            0
application_type                  0
original_listed_time              0
location                          0
formatted_work_type               0
title                             0
scraped                     

In [None]:
# Maximum length of the column name:
max(len(col) for col in data.columns)

26

### Binning data

In [None]:
# Binning of the Maturity months::
#bins = [0, 13, 25, 37, 49, 61, 73, 85, np.inf]  # Nastavení hranic bins
#labels = ['01: 0-12', '02: 13-24', '03: 25-36', '04: 37-48', '05: 49-60', '06: 61-72', '07: 73-84', '08: 85-96']  # Názvy kategorií
bins = [0, 24, 36, 48, 60, 72, 84, np.inf]  # Nastavení hranic bins
labels = ['01: 0-24', '02: 25-36', '03: 37-48', '04: 49-60', '05: 61-72', '06: 73-84', '07: 85-96']  # Názvy kategorií
data['Maturity_Months_cat'] = pd.cut(data['Maturity_Months'], bins=bins, labels=labels)
data['Maturity_Months_cat'].value_counts()

### Plotting

In [None]:
plt.figure(figsize=(7,5))
plt.title('Probability of prepayment - Decision_score_new_cat', fontsize=12, y=1.02)
plt.xlabel('Score', fontsize=12, labelpad=10)
plt.xticks(fontsize=10)
plt.ylabel('Probability of prepayment', fontsize=12, labelpad=10)
plt.yticks(fontsize=10)
ax=plt.gca()

# Get the values from the results dataframe:
x_values = ['0-349',350,400,450,500,550,600,650,'700-inf']
y1_values = result[target_flag,'mean'].values
y2_values = result[new_pred,'mean'].values
plt.plot(x_values, y1_values, '-o', label='Data (Prepayments)')
plt.plot(x_values, y2_values, '-o', label='Prediction (Prepayments)')
plt.legend(fontsize=10)
#ax.set_xlim(xmin=0)
ax.set_ylim([0, 1.4*max(result[target_flag,'mean'])])  # Setting y-axis limits from 0 to 40% above the maximum data value

# Recording a plot to MLFlow experiment:
fig = plt.gcf()
mlflow.log_figure(fig, 'Probability of prepayment - Decision_score_new_cat.png')

# Showing the results:
plt.show()
print('\n')
result

In [None]:
# Probability of Default - Decision Score:
result = data.groupby('Decision_score_new_cat')[[target_flag, new_pred]].aggregate(['mean', 'size'])

# In percentage:
#result[target_flag, 'mean'] *= 100
#result[new_pred, 'mean'] *= 100

plt.figure(figsize=(7,5))
plt.title('Probability of default - Decision_score_new_cat', fontsize=12, y=1.02)
plt.xlabel('Score', fontsize=12, labelpad=10)
plt.xticks(fontsize=10)
plt.ylabel('Probability of default', fontsize=12, labelpad=10)
plt.yticks(fontsize=10)

# Get the values from the results dataframe:
x_values = ['0-349',350,400,450,500,550,600,650,'700-inf']
y1_values = result[target_flag,'mean'].values
y2_values = result[new_pred,'mean'].values
plt.plot(x_values, y1_values, '-o', label='Data (Defaults)')
plt.plot(x_values, y2_values, '-o', label='Prediction (Defaults)')
plt.legend(fontsize=10)

# Recording a plot to MLFlow experiment:
fig = plt.gcf()
mlflow.log_figure(fig, 'Probability of default - Decision_score_new_cat.png')

# Showing the results:
plt.show()
print('\n')
result

In [None]:
# Calculating results grouped by Rating:

# Group the dataframe by the rating column:
grouped_by_rating = data_ERM.groupby('Decision_Rating_New_MSE')

# Calculate the count of the observations for each rating
count_by_rating = data_ERM.groupby('Decision_Rating_New_MSE').size().reset_index(name='Count')

# Define a function to calculate the weighted average:
def volume_weighted_averages(df):
    weights = df['Loan_Amount']
    volume_weighted_avg = (df[['survival_prediction_prep', 'survival_prediction_bad', 'cum_survival', 'cum_default', 'cum_prep', 'ERM_FINAL']] * weights.values[:, np.newaxis]).sum() / weights.sum()
    return volume_weighted_avg

# Calculate the weighted average for each value of the rating column:
volume_weighted_avg = grouped_by_rating.apply(volume_weighted_averages).reset_index()
volume_weighted_avg = grouped_by_rating.apply(volume_weighted_averages)

# Join the count and the weighted averages dataframes using the rating column:
result = volume_weighted_avg.merge(count_by_rating, on='Decision_Rating_New_MSE')

result['ERM_FINAL'] = result['ERM_FINAL']*12
results_grouped_by_rating = result

# Exporting the results to MLFlow artifact:
results_grouped_by_rating.to_csv('Results_grouped_by_rating.csv', float_format='%.9f', index=False)
mlflow.log_artifact('Results_grouped_by_rating.csv')

# Creating a plot:
plt.figure(figsize=(7,5))
plt.title('Estimation of ERM', fontsize=12, y=1.02)
plt.xlabel('Decision_Rating_New_MSE', fontsize=12, labelpad=10)
plt.xticks(fontsize=10)
plt.ylabel('Probability', fontsize=12, labelpad=10)
plt.yticks(fontsize=10)
ax=plt.gca()

# Get the values from the results dataframe:
x_values = results_grouped_by_rating['Decision_Rating_New_MSE']
#y1_values = results_grouped_by_rating['survival_prediction_prep'].values
#y2_values = results_grouped_by_rating['survival_prediction_bad'].values
y1_values = results_grouped_by_rating['cum_prep'].values
y2_values = results_grouped_by_rating['cum_default'].values
#y3_values = results_grouped_by_rating['cum_survival'].values
y4_values = results_grouped_by_rating['ERM_FINAL'].values

plt.plot(x_values, y1_values, '-o', label='Prepayments (cumulative)')
plt.plot(x_values, y2_values, '-o', label='Defaults (cumulative)')
#plt.plot(x_values, y3_values, '-o', label='Survival', linewidth=4.0)
plt.plot(x_values, y4_values, '-o', label='ERM')
plt.legend(fontsize=10)
#ax.set_xlim(xmin=0)
#ax.set_ylim([0, 1.4*max(result[target_flag,'mean'])])  # Setting y-axis limits from 0 to 40% above the maximum data value

# Recording a plot to MLFlow experiment:
fig = plt.gcf()
mlflow.log_figure(fig, 'Estimation of ERM - Rating.png')

# Showing the results:
plt.show()
print('\n')
results_grouped_by_rating

In [None]:
# Changing data type to float:
for column in data:
    try:
        data[column] = data[column].astype('float')
    except:
        print(column + ' is not float.')

In [None]:
# Convert the array to a DataFrame:

array = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

df = pd.DataFrame(array, columns=['name','state','ssss'])

# Display the DataFrame
print(df)

   name  state  ssss
0     1      2     3
1     4      5     6
2     7      8     9
