Skip to content

joseortegalabra/exploratory-data-analysis-automatic

Repository files navigation

Automatic EDA

Write a configuration json file "config.json" and run a python script "main.py" to generate automatic EDA for forecast time series

Observations:

  • Most of the analysis can be applied for other kind of data. Only trends plots and acf/pacf are applied only to time series
  • In the folder data there are a jupyter notebook to generate 3 differents datasets used in this notebook. But the pkl file with the data are not pushed
  • There is a folder output_eda where are saved the plots output of the eda. This output are not pushed but can be obtained running the codes
  • There codes to do subplots that are invoked in the script main.py. But, in addition, there are codes to generate individual plots that there are not called in the script main.py (for example: there a function to plot subplots of histograms that are called in the script main.py but also there a function to plot the histogram of only one feature). For ALL PLOTS there the version of individual plot and the version of subplot
  • The repo with jupyter notebooks used to generate this scripts can be find in the following link: https://github.com/joseortegalabra/exploratory-data-analysis

Templates codes:

  • There are a lot of functions that recibe a pandas dataframe and other parameters that each function needs and return a plotly figure
  • Then with the plotly figure you can see in a jupyter notebok with the method fig.show() or you can saved it in a folder with the methods fig.write_html to get a html interactive figure or fig.write_image to get a static image similar to other packages as matplotlib or seaborn
  • In the script main.py the order is read the config file, then read the parameters since de config.json that you will use and finally call the function to generate the plots to get a plotly figure and finally decide what to do (show, save html, save png, save pdf, etc)

Run codes:

  • It is very simple
  • Open a console, for example anaconda prompt
  • activate env that you are using. conda env list. conda activate -name
  • navigate into folder where are located this repo. cd .. cd automatic-exploratory-data-analysis
  • run script main.py. -> python main.py

Explications config.json

Explications of config.json to complete it. Important, this configuration is only for the plots that need it. In the codes a lot of more plots is generated, but this ones doens't need parameters


Initial parametes

global parameters

"name_report": "indicate the name of the report"

"name_data_pkl": "indicate name of the file that have the data"

"target": "indicate name of target"

"list_features": "indicate list of features"

"number_columns": "indicate the numbers of columns to plot that accepted multiple columns"

"reports_to_show" Indicate true/false which reports to do the plots and which reports skip


Univariate analysis

"ydata_profiling"

"minimal": do a minimal report of ydata-profiling. always true when the dataset is huge

"zoom_tendency:(start_date, end_date)": indicate the dates to plot trends. When the data is huge plot all the data could be too much

"smooth_data": indicate the parameters of differents ways of smooth the data. Such as, moving average, weighted moving average and exponential moving average

acf_pacf:lags indicate the max number of lags to plot the autocorrelation function and partial autocorrelation function


Bivariate analysis

correlations:(threshold): indicate if add a threshold to show the correlations. for example, only show the correlations with value over 0.1

scatter_plot:(marginal): plot a scatter plot and a marginal histograms of each feature in the scatter plot

correlations_features_lagged_target:(lags): indicate the number of lags in the features used to analyze the correlations of the features lagged againts the target

parallel:(list_features):indicate the list of features to plot into a parallel plot vs the target as final step


Segmentation analysis

segmentation_analysis

"type": indicate type of segmentation. custom or by percentile

"var_segment": indicate feature or target to segment the data

"interval_segment": if custom segmentation, indicate the intervals of the values to generate the differents segments

"labels_segment": if custom segmentation, indicate the name of the differents segments


Categorical analysis

categorical_analysis

"features": list of features that will transformend into a categorical variable. It is neccesary transform all the features of the dataframe

"percentile_features": list of kind of percentile to categorize each feature. choices: quartile, quintile, decile

"target":lit of the target that will transformed into a categorical variable

"percentile_target":list of kind of percentile to categorize the target. choices: quartile, quintile, decile

"crosstab_freq_pair_features (freq_normalized)": normalize the table of frecuency between each pair of features

"crosstab_freq_target_feature (freq_normalized)": normalize the table of frecuency between each feature vs target

"heatmap_multiple_features_vs_target_continuous (aggregation_target)": aggregation of the target in a comparative table between each categorie in pair of feature and the aggregation of the target, for example mean, std, min, max, etc

parallel:(list_features):indicate the list of features to plot into a parallel plot vs the target as final step


Kind of plots presents in this repo - notebooks with codes used to develop this plots

ydata-profiling

1 link-data-profiling

Univariate Analysis

1 descriptive statistics table

2 histograms

3 kernel density + histograms

4 original data trend

5 boxplots with monthly aggregation

6 original vs smoothed data trend

7 Autocorrelation and partial autocorrelation functions-v1

7 Autocorrelation and partial autocorrelation functions-v2

8 Other analyzes for time series-v1

8 Other analyzes for time series-v2

Bivariate Analysis

1 correlations

2 scatter plots

3 correlations features with lag

4 Multivariate parallel plot

Segmentation Analysis

0 generate data segmentation

1 segmented data distribution

2 descriptive statistical table segmented data

3 histograms and boxplots segmented data

4 trend segmented data

5 segmented data correlations

6 scatter plots segmented data

7 parallel plot only when target is segmented

Categorical Analysis

0 generate categorical data

2 crosstab frequency features and target

3 frequency between 2 categorical features

4 univariate analysis categorical features vs continuous target

5 univariate analysis categorical features vs categorical target

6 bivariate analysis categorical features vs continuous target

7 bivariate analysis categorical features vs categorical target

8 parallel plot features categorical vs categorical target

9 woe iv - categorical features vs binary target

About

scripts codes to run an automatic exploratory data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published