## <font color='darkblue'>Preface</font>
([article source](https://towardsdatascience.com/speeding-up-exploratory-data-analysis-with-python-838fe5e25b43)) <font size='3ptx'><b>Increasing Productivity with Open-Source EDA Libraries</b></font>

<b><font color='darkblue'>Exploratory Data Analysis</font></b> (EDA) — especially when done for take-home home job interviews — can be a very time consuming process. <b>EDA’s main purpose is to investigate and learn about a new dataset. We want to gain insights and familiarity with the data — ideally as quickly as possible.</b>

The instructions for take-home assignments often come with an instruction stating to take no more than 4 or 6 hours to complete. <font color='darkred'><b>When on the job, the boss may be waiting impatiently for our analysis</b></font>.

Here we will explore 3 open-source libraries that can help speed up EDA process:
* <font size='3ptx'><b><a href='#sect1'>DataPrep</a></b></font>
* Pandas Profiling
* SweetViz
<br/><br/>

In <b><a href='https://towardsdatascience.com/tackling-the-take-home-challenge-7c2148fb999e'>“Tackling the Take-Home Challenge”</a></b> we worked through a sample take-home EDA challenge using Python, Pandas, and Seaborn. The notebook was time consuming to create and we used a lot of code. <b>Let’s see if we can speed up our analysis with the help of the above libraries</b>.

### <font color='darkgreen'>Dataset</font>
We’ll continue using the widget factory datasets we created in [this article](https://towardsdatascience.com/generating-fake-data-with-python-c7a32c631b2a), where we worked through generating fake data with Python.

In [1]:
#!pip install faker
!pip install faker



In [2]:
import pandas as pd
import numpy as np
import faker

# create some fake data
fake = faker.Faker()

# function to create a dataframe with fake values for our workers
def make_workers(num):
    
    # lists to randomly assign to workers
    status_list = ['Full Time', 'Part Time', 'Per Diem']
    team_list = [fake.color_name() for x in range(4)]
    

    fake_workers = [
        {
            'Worker ID':x+1000,
            'Worker Name':fake.name(), 
            'Hire Date':fake.date_between(start_date='-30y', end_date='today'),
            'Worker Status':np.random.choice(status_list, p=[0.50, 0.30, 0.20]), # assign items from list with different probabilities
            'Team':np.random.choice(team_list)
        } for x in range(num)]
        
    return fake_workers

worker_df = pd.DataFrame(make_workers(num=5000))
worker_df.head()

Unnamed: 0,Worker ID,Worker Name,Hire Date,Worker Status,Team
0,1000,Kaitlyn Anderson,1993-02-27,Full Time,Gainsboro
1,1001,Melanie Hensley,2013-10-17,Per Diem,Red
2,1002,Nathan Fox,2007-05-02,Full Time,SteelBlue
3,1003,Amanda Mitchell,2007-05-04,Full Time,SteelBlue
4,1004,Bonnie Melendez,2000-03-13,Per Diem,Gainsboro


<a id='sect1'></a>
## <font color='darkblue'>DataPrep</font>
<font size='3ptx'><b><a href='https://dataprep.ai/'>DataPrep</a> bills itself as “the fastest and the easiest EDA tool in Python. It allows data scientists to understand a Pandas/Dask DataFrame with a few lines of code in seconds.”</b></font>

It is indeed very easy to use and can reduce the EDA process to just a few lines of code. DataPrep is easy to install with pip:

In [3]:
!pip install dataprep

Collecting dataprep
  Using cached dataprep-0.4.1-py3-none-any.whl (3.5 MB)
Collecting flask_cors<4.0.0,>=3.0.10
  Using cached Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting levenshtein<0.13.0,>=0.12.0
  Using cached levenshtein-0.12.0-cp38-cp38-win_amd64.whl (82 kB)
Collecting jsonpath-ng<2.0,>=1.5
  Using cached jsonpath_ng-1.5.3-py3-none-any.whl (29 kB)
Collecting metaphone<0.7,>=0.6
  Using cached Metaphone-0.6-py3-none-any.whl
Collecting usaddress<0.6.0,>=0.5.10
  Using cached usaddress-0.5.10-py2.py3-none-any.whl (63 kB)
Collecting jinja2<3.0,>=2.11
  Using cached Jinja2-2.11.3-py2.py3-none-any.whl (125 kB)
Collecting varname<0.9.0,>=0.8.1
  Using cached varname-0.8.1-py3-none-any.whl (20 kB)
Collecting regex<2021.0.0,>=2020.10.15
  Using cached regex-2020.11.13-cp38-cp38-win_amd64.whl (270 kB)
Collecting dask[array,dataframe,delayed]<3.0,>=2.25
  Using cached dask-2.30.0-py3-none-any.whl (848 kB)
Collecting aiohttp<4.0,>=3.6
  Using cached aiohttp-3.8.1-cp38-cp38-win

  error: subprocess-exited-with-error
  
  Building wheel for bottleneck (pyproject.toml) did not run successfully.
  exit code: 1
  
  [19 lines of output]
  running bdist_wheel
  running build
  running build_py
  UPDATING build\lib.win-amd64-3.8\bottleneck/_version.py
  set build\lib.win-amd64-3.8\bottleneck/_version.py to '1.3.2'
  running build_ext
  running config
  compiling '_configtest.c':
  
  
  
  int __attribute__((optimize("O3"))) have_attribute_optimize_opt_3(void*);
  
  int main(void)
  {
      return 0;
  }
  
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for bottleneck
ERROR: Could not build wheels for bottleneck, which is required to install pyproject.toml-based projects


We can use the <font color='blue'>plot()</font> method to visualize our entire dataframe and to show key insights:

In [4]:
from dataprep.eda import plot

# using dataprep's plot method to get insights on each variable
plot(worker_df)

ModuleNotFoundError: No module named 'dataprep'