# Exploratory Data Analysis (EDA) and essay

## ***Introduction***

*This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - "better data beats better algorithms" - meaning that it is more productive to spend time improving data quality than improving the code to train the model.*

*This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis.*

## ***Dataset***

*A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank's term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing*

## ***Assignment***

1. ***Exploratory Data Analysis***

    *Review the structure and content of the data and answer questions such as:*
    - *Are the features (columns) of your data correlated?*
    - *What is the overall distribution of each variable?*
    - *Are there any outliers present?*
    - *What are the relationships between different variables?*
    - *How are categorical variables distributed?*
    - *Do any patterns or trends emerge in the data?*
    - *What is the central tendency and spread of each variable?*
    - *Are there any missing values and how significant are they?* 

2. ***Algorithm Selection***

    *Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:*
    - *Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).*
    - *What are the pros and cons of each algorithm you selected?*
    - *Which algorithm would you recommend, and why?*
    - *Are there labels in your data? Did that impact your choice of algorithm?*
    - *How does your choice of algorithm relates to the dataset?*
    - *Would your choice of algorithm change if there were fewer than 1,000 data records, and why?*

3. ***Pre-processing***

    *Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:*
    - *Data Cleaning - improve data quality, address missing data, etc.*
    - *Dimensionality Reduction - remove correlated/redundant data than will slow down training*
    - *Feature Engineering - use of business knowledge to create new features*
    - *Sampling Data - using sampling to resize datasets*
    - *Data Transformation - regularization, normalization, handling categorical variables*
    - *Imbalanced Data - reducing the imbalance between classes*

## Step 1: Exploratory Data Analysis

First, we import load the dataset using the code provided from the website and the `ucimlrepo` package:

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
features = bank_marketing.data.features 
targets = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 
  
# variable information 
print(bank_marketing.variables) 


{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the s

## Step 2: Algorithm Selection

## Step 3: Pre-processing