<h1 align="center">DATA PREPARATION</h1>
<h2 align="left"><ins>Contents</ins></h2>

- [**WHAT IS DATA PREPARATION?**](#intro)
    - [**Machine Learning Process**](#mlp)
        - [**Define Problem**](#define)
        - [**Prepare Data**](#prepare)
        - [**Evaluate Models**](#evaluate)
        - [**Finalize Model**](#finalize)
    - [**How to Choose Data Preparation Techniques**](#how)
    - [**Machine Learning Algorithms Expect Numbers**](#expect)
    - [**Machine Learning Algorithms Have Requirements**](#require)
    - [**Model Performance Depends on Data**](#depend)
- [**DATA PREPARATION TECHNIQUES**](#tech)
    - [**Common Tasks of Data Preparation**](#tasks)
        - [**Data Cleaning**](#clean)
        - [**Feature Selection**](#feat)
        - [**Data Transforms**](#trans)
        - [**Feature Engineering**](#eng)
        - [**Dimensionality Reduction**](#dim)
- [**LOADING DATA**](#load)
    - [**Scikit-Learn Sample Datasets**](#sample)
    - [**Creating a Simulated Dataset**](#create)
- [**REFERENCES**](#ref)

In [1]:
import numpy as np
import pandas as pd
import re

In [2]:
# removes scientific notation from floating point numbers
np.set_printoptions(suppress=True)

<a id="intro"></a>
<h1 align="center">WHAT IS DATA PREPARATION?</h1>

The vast majority of the machine learning algorithms being used have been around for some time now. The implementation and application of these algorithms are well understood. What differs between projects is the data. As such, the preparation of the data is a primary task of any modern machine learning project. Data preparation is the most time consuming part, but may be the most important part of a machine learning project.

On a predictive modeling project, such as classification or regression, raw data typically cannot be used directly. This is because of reasons such as:
- Machine learning algorithms require data to be numbers.
- Some machine learning algorithms impose requirements on the data.
- Statistical noise and errors in the data may need to be corrected.
- Complex nonlinear relationships may be teased out of the data.

As such, the raw data must be pre-processed prior to being used to fit and evaluate a machine learning model. This step in a predictive modeling project is referred to as **data preparation**, although it goes by many other names, such as *data wrangling, data cleaning, data pre-processing and feature engineering*. Some of these names may better fit as sub-tasks for the broader data preparation process. 
>**We can define data preparation as the act of transforming raw data into a form that is appropriate for modeling.**

Data preparation is not performed blindly. Machine learning algorithms require numerical input data, and most algorithm implementations maintain this expectation. As such, if the data contains data types and values that are not numbers, such as labels, then some form of transformation is required to change the data into numbers. In some cases it is less clear, for example: scaling a variable may or may not be useful to an algorithm. *Specific machine learning algorithms have expectations regarding the data types, scale, probability distribution, and relationships between input variables, and we may need to change the data to meet these expectations.*
>**The broader philosophy of data preparation is to discover how to best expose the unknown underlying structure of the problem to the learning algorithms.** This often requires an iterative path of experimentation through a suite of different data preparation techniques in order to discover what works well or best. 

This is the guiding light. *We don’t know the underlying structure of the problem; if we did, we wouldn’t need a learning algorithm to discover it and learn how to make skillful predictions.* Therefore, exposing the unknown underlying structure of the problem is a process of discovery, along with discovering the well- or best-performing learning algorithms for the project.

It can be more complicated than it appears at first glance. For example, different input variables may require different data preparation methods. Further, different variables or subsets of input variables may require different sequences of data preparation methods. It can feel overwhelming, given the large number of methods, each of which may have their own configuration and requirements. Nevertheless, the machine learning process steps before and after data preparation can help to inform what techniques to consider

<a id="mlp"></a>
<h3><ins>Machine Learning Process</ins></h3>

The challenge of data preparation is that each dataset is unique and different. Datasets differ in the number of variables (tens, hundreds, thousands, or more), the types of the variables (numeric, nominal, ordinal, boolean), the scale of the variables, the drift in the values over time, and more. As such, this makes data preparation a challenge. 

Nevertheless, there are enough commonalities across predictive modeling projects that can define a loose sequence of steps and subtasks that are likely to be performed. Even though each project is unique, the steps on the path to a good or even the best result are generally the same from project to project. This is sometimes referred to as the **applied machine learning process**. The steps are the same, but the names of the steps and tasks performed may differ from description to description.
>No one can tell you what the best results are or might be, or what algorithms to use to achieve them. You must establish a **baseline** in performance as a point of reference to compare all of your models and discover what algorithm works best for your specific dataset.

<a id="define"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Step 1:</b> Define the Problem</p>

This step is concerned with learning enough about the project to select the framing or framings of the prediction task. For example, is it classification or regression, or some other higher-order problem type? It involves collecting the data that is believed to be useful in making a prediction and clearly defining the form that the prediction will take. It may also involve talking to project stakeholders and other people with deep expertise in the domain. This step also involves taking a close look at the data, as well as perhaps exploring the data using summary statistics and data visualization. 

<a id="prepare"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Step 2:</b> Prepare the Problem</p>

This step is concerned with transforming the raw data that was collected into a form that can be used in modeling. For example, cleaning the data, transforming the data to numerical, preprocessing, etc.

<a id="evaluate"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Step 3:</b> Evaluate the Problem</p>

This step is concerned with evaluating machine learning models. This involves tasks such as selecting a performance metric for evaluating the skill of a model, establishing a baseline or floor in performance to which all model evaluations can be compared, and a resampling technique for splitting the data into training and test sets to simulate how the final model will be used.<br>
$\;\;\;\;\;\;$For quick and dirty estimates of model performance, or for a very large dataset, a single train-test split of the data may be performed. It is more common to use k-fold cross-validation as the data resampling technique, often with repeats of the process to improve the robustness of the result. This step also involves tasks for getting the most out of well performing models such as hyperparameter tuning and ensembles of models.

<a id="finalize"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Step 4:</b> Finalize the Problem</p>

This step is concerned with selecting and using a final model. Once a suite of models has been evaluated, you must choose a model that represents the solution to the project. This is called model selection and may involve further evaluation of candidate models on a hold out validation dataset, or selection via other project-specific criteria such as model complexity. It may also involve summarizing the performance of the model in a standard way for project stakeholders, which is an important step. Finally, there will likely be tasks related to the productization of the model, such as integrating it into a software project or production system and designing a monitoring and maintenance schedule for the model.

<a id="how"></a>
<h3><ins>How to Choose Data Preparation Techniques</ins></h3>

The step before data preparation involves defining the problem. As part of defining the problem, this may involve many sub-tasks, such as:
- Gather data from the problem domain.
- Discuss the project with subject matter experts.
- Select those variables to be used as inputs and outputs for a predictive model.
- Review the data that has been collected.
- Summarize the collected data using statistical methods.
- Visualize the collected data using plots and charts.

Information known about the data can be used in selecting and configuring data preparation methods. For example, plots of the data may help identify whether a variable has outlier values. This can help in data cleaning operations. It may also provide insight into the probability distribution that underlies the data. This may help in determining whether data transforms that change a variable’s probability distribution would be appropriate. Statistical methods, such as descriptive statistics, can be used to determine whether scaling operations might be required. Statistical hypothesis tests can be used to determine whether a variable matches a given probability distribution.

Pairwise plots and statistics can be used to determine whether variables are related, and if so, how much, providing insight into whether one or more variables are redundant or irrelevant to the target variable. As such, there may be a lot of interplay between the definition of the problem and the preparation of the data. There may also be interplay between the data preparation step and the evaluation of models. Model evaluation may involve sub-tasks such as:
- Select a performance metric for evaluating model predictive skill.
- Select a model evaluation procedure.
- Select algorithms to evaluate.
- Tune algorithm hyperparameters.
- Combine predictive models into ensembles.

Information known about the choice of algorithms and the discovery of well performing algorithms can also inform the selection and configuration of data preparation methods. For example, the choice of algorithms may impose requirements and expectations on the type and form of input variables in the data. This might require variables to have a specific probability distribution, the removal of correlated input variables, and/or the removal of variables that are not strongly related to the target variable. 

The choice of performance metric may also require careful preparation of the target variable in order to meet the expectations, such as scoring regression models based on prediction error using a specific unit of measure, requiring the inversion of any scaling transforms applied to that variable for modeling. These examples, and more, highlight that although data preparation is an important step in a predictive modeling project, it does not stand alone. Instead, it is strongly influenced by the tasks performed both before and after data preparation. This highlights the highly iterative nature of any predictive modeling project.

<a id="expect"></a>
<h3><ins>Machine Learning Algorithms Expect Numbers</ins></h3>

Even though your data is represented in one large table of rows and columns, the variables in the table may have different data types. Some variables may be numeric, such as integers, floating-point values, ranks, rates, percentages, and so on. Other variables may be names, categories, or labels represented with characters or words, and some may be binary, represented with 0 and 1 or True and False. The problem is, machine learning algorithms at their core operate on numeric data. They take numbers as input and predict a number as output. All data is seen as vectors and matrices, using the terminology from linear algebra.

<a id="require"></a>
<h3><ins>Machine Learning Algorithms Have Requirements</ins></h3>

Even if the raw data contains only numbers, some data preparation is likely required. There are many different machine learning algorithms to choose from for a given predictive modeling project. We cannot know which algorithm will be appropriate, let alone the most appropriate for the task. Therefore, it's good practice to evaluate a suite of different candidate algorithms systematically and discover what works well or best on the data. The problem is, each algorithm has specific requirements or expectations with regard to the data.

For example, some algorithms assume each input variable, and perhaps the target variable, to have a specific probability distribution. This is often the case for linear machine learning models that expect each numeric input variable to have a Gaussian probability distribution. This means that if you have input variables that are not Gaussian or nearly Gaussian, you might need to change them so that they are Gaussian or more Gaussian. Alternatively, it may encourage you to reconfigure the algorithm to have a different expectation on the data.

Some algorithms are known to perform worse if there are input variables that are irrelevant or redundant to the target variable. There are also algorithms that are negatively impacted if two or more input variables are highly correlated. In these cases, irrelevant or highly correlated variables may need to be identified and removed, or alternate algorithms may need to be used. There are also algorithms that have very few requirements about the probability distribution of input variables or the presence of redundancies, but in turn, may require many more examples (rows) in order to learn how to make good predictions.

As such, there is an interplay between the data and the choice of algorithms. Primarily, the algorithms impose expectations on the data, and adherence to these expectations requires the data to be appropriately prepared. Conversely, the form of the data may provide insight into those algorithms that are more likely to be effective.

<a id="depend"></a>
<h3><ins>Model Performance Depends on Data</ins></h3>

The performance of a machine learning algorithm is only as good as the data used to train it. This is often summarized as **garbage in, garbage out**, which basically implies a weak representation of the problem that insufficiently captures the dynamics required to learn how to map examples of inputs to outputs. A dataset may be a weak representation of the problem we are trying to solve for many reasons, although there are two main classes of reason.
- **Complex Data:** It may be because complex nonlinear relationships are compressed in the raw data that can be unpacked and exposed using data preparation techniques.
- **Messy Data:** It may also be because the data is not perfect, ranging from mild random fluctuations in the observations, referred to as a statistical noise, to missing values or errors that result in out-of-range values and conflicting data.

>**Given that machine learning algorithms are routine for the most part, the one thing that changes from project to project is the specific data used in the modeling.**

<a id="tech"></a>
<h1 align="center">DATA PREPARATION TECHNIQUES</h1>

<a id="tasks"></a>
<h3><ins>Common Tasks of Data Preparation</ins></h3>

There are common or standard tasks that can be used or explored during the data
preparation step in a machine learning project. These tasks include:
- **Data Cleaning:** Identifying and correcting mistakes or errors in the data.
- **Feature Selection:** Identifying those input variables that are most relevant to the task.
- **Data Transforms:** Changing the scale or distribution of variables.
- **Feature Engineering:** Deriving new variables from available data. The process of creating representations of data that increase the effectiveness of a model.
- **Dimensionality Reduction:** Creating compact projections of the data.

<a id="clean"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Data Cleaning</b></p>
Data cleaning involves fixing systematic problems or errors in messy data. The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect. There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be identified as they are different from what is expected.

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed. This might involve removing a row or a column. Alternately, it might involve replacing observations with new values. As such, there are general data cleaning operations that can be performed, such as:
- Using statistics to define normal data and identify outliers.
- Identifying columns that have the same value or no variance and removing them.
- Identifying duplicate rows of data and removing them.
- Marking empty values as missing.
- Imputing missing values using statistics or a learned model.
Data cleaning is an operation that is typically performed first, prior to other data preparation
operations.

<img src="./images/Data Preparation/data_cleaning.png" width=400 height=400>

<a id="feat"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Feature Selection</b></p>
Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted. This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to develop models only using the data that is required to make a prediction, e.g. to favor the simplest possible well performing model.

Feature selection techniques may generally be grouped into those that use the target variable (supervised) and those that do not (unsupervised). Additionally, the supervised techniques can be further divided into models that automatically select features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best performing model (wrapper) and those that score each input feature and allow a subset to be selected (filter).

Statistical methods, such as correlation, are popular for scoring input features. The features can then be ranked by their scores and a subset with the largest scores used as inputs to a model. The choice of statistical measure depends on the data types of the input variables. Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:
- Categorical inputs for a classification target variable.
- Numerical inputs for a classification target variable.
- Numerical inputs for a regression target variable.
When a mixture of input variable data types is present, different filter methods can be used. Alternately, a wrapper method such as the popular Recursive Feature Elimination (RFE) method can be used that is agnostic to the input variable type. The broader field of scoring the relative importance of input features is referred to as feature importance and many model-based techniques exist whose outputs can be used to aide in interpreting the model, interpreting the dataset, or in selecting features for modeling.

<img src="./images/Data Preparation/Feature_Selection.png" width=400 height=400>

<a id="trans"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Data Transforms</b></p>

Data transforms are used to change the type or distribution of data variables. This is a large umbrella of different techniques and they may be just as easily applied to input and output variables. Recall that data may have one of a few types, such as numeric or categorical, with subtypes for each, such as integer and real-valued floating point values for numeric, and nominal, ordinal, and boolean for categorical.
- **Numeric Data Type:** Number values.
    * Integer: Integers with no fractional part.
    * Float: Floating point values.

- **Categorical Data Type:** Label values.
    * Ordinal: Labels with a rank ordering.
    * Nominal: Labels with no rank ordering.
    * Boolean: Values True and False.

We may wish to convert a numeric variable to an ordinal variable in a process called discretization. Alternatively, we may encode a categorical variable as integers or boolean variables, required on most classification tasks.
- **Discretization Transform:** Encode a numeric variable as an ordinal variable.
- **Ordinal Transform:** Encode a categorical variable into an integer variable.
- **One Hot Transform:** Encode a categorical variable into binary variables.

For real-valued numeric variables, the way they are represented in a computer means there is dramatically more resolution in the range 0-1 than in the broader range of the data type. As such, it may be desirable to scale variables to this range, called normalization. If the data has a Gaussian probability distribution, it may be more useful to shift the data to a standard Gaussian with a mean of zero and a standard deviation of one.
- **Normalization Transform:** Scale a variable to the range 0 and 1.
- **Standardization Transform:** Scale a variable to a standard Gaussian.

The probability distribution for numerical variables can be changed. Forexample, if thedistribution is nearly Gaussian, but is skewed or shifted, it can be made moreGaussian usinga power transform. Alternatively, quantile transforms can be used to force aprobabilitydistribution, such as a uniform or Gaussian on a variable with an unusualnatural distribution.
- **Power Transform:** Change the distribution of a variable to be more Gaussian.
- **Quantile Transform:** Impose a probability distribution such as uniform or Gaussian.

An important consideration with data transforms is that the operations are generally performed separately for each variable. As such, we may want to perform different operations on different variable types. We may also want to use the transform on new data in the future. This can be achieved by saving the transform objects to file along with the final model trained on all available data.

<img src="./images/Data Preparation/Transforms.png" width=450 height=450>

<a id="eng"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Feature Engineering</b></p>
Feature engineering refers to the process of creating new input variables from the available data. Engineering new features is highly specific to the data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features that could be constructed from the data. This specialization makes it a challenging topic to generalize to general methods. Nevertheless, there are some techniques that can be reused, such as:
<ul><li>Adding a boolean flag variable for some state.</li>
<li>Adding a group or global summary statistic, such as a mean.</li>
<li>Adding new variables for each component of a compound variable, such as a date-time.</li></ul>
A popular approach drawn from statistics is to create copies of numerical input variables that have been changed with a simple mathematical operation, such as raising them to a power or multiplied with other input variables, referred to as **polynomial features**.

>The theme of feature engineering is to add broader context to a single observation or decompose a complex variable, both in an effort to provide a more straightforward perspective on the input data.

<a id="dim"></a>
<p style="padding-left:2em; text-decoration:underline"><b>Dimensionality Reduction</b></p>
The number of input features for a dataset may be considered the dimensionality of the data. For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes. The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is referred to as **the curse of dimensionality**.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data. This is referred to generally as **dimensionality reduction**. Unlike feature selection, the variables in the projected data are not directly related to the original input variables, making the projection difficult to interpret. The most common approach to dimensionality reduction is to use a matrix factorization technique:
<ul><li>Principal Component Analysis</li>
<li>Singular Value Decomposition</li></ul>
The main impact of these techniques is that they remove linear dependencies between input variables, e.g. correlated variables. Other approaches exist that discover a lower dimensionality reduction. We might refer to these as model-based methods such as linear discriminant analysis and perhaps autoencoders.
<ul><li>Linear Discriminant Analysis</li></ul>
Sometimes manifold learning algorithms can also be used, such as Kohonen self organizing maps (SOME) and t-Distributed Stochastic Neighbor Embedding (t-SNE).

<img src="./images/Data Preparation/Dimensionality.png" width=300 height=300>

<a id="load"></a>
<h1 align="center">LOADING DATA</h1>

The first step in any machine learning endeavor is to get the raw data

<a id="sample"></a>
<h5 style="text-decoration:underline">Scikit-Learn Sample Datasets</h5>

Often we do not want to go through the work of loading, transforming, and cleaning a real-world dataset before we can explore some machine learning algorithm or method. Luckily, scikit-learn comes with some common datasets we can quickly load. These datasets are often called “toy” datasets because they are far smaller and cleaner than a dataset we would see in the real world.

In [3]:
# Load scikit-learn's datasets
from sklearn import datasets

print(dir(datasets))

['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups', 'clear_data_home', 'dump_svmlight_file', 'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 'get_data_home', 'load_boston', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_friedman2', 'make

### Example Setup
```python
# Load digits dataset
digits = datasets.load_digits()

# Create features matrix
features = digits.data

# Create target vector
target = digits.target

# dir(digits)
>>>> ['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']
```

<a id="create"></a>
<h5 style="text-decoration:underline">Creating a Simulated Dataset</h5>

Scikit-learn offers many methods for creating simulated data. Of those, three
methods are particularly useful.
- `make_regression` returns a feature matrix of float values and a target vector of float values, while `make_classification` and `make_blobs` return a feature matrix of float values and a target vector of integers representing membership in a class.
    - It is worth reviewing scikit-learn’s documentation for a full description of all the parameters.
        - In [make_regression](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression) and [make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification), `n_informative` determines the number of features that are used to generate the target vector. If `n_informative` is less than the total number of features (n_features), the resulting dataset will have redundant features that can be identified through **feature selection techniques**.
        - In addition, `make_classification` contains a weights parameter that allows us to simulate datasets with imbalanced classes. For example, `weights = [.25, .75]` would return a dataset with 25% of observations belonging to one class and 75% of observations belonging to a second class.
        - For [make_blobs](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs), the centers parameter determines the number of clusters generated. Using the matplotlib visualization library, we can visualize the clusters generated by make_blobs

```python
# Load librarys
from sklearn.datasets import make_regression, make_classification, make_blobs

# Regression
------------
# Generate features matrix, target vector, and the true coefficients
features, target, coefficients = make_regression(n_samples = 100, n_features = 3,
                                                 n_informative = 3, n_targets = 1,
                                                 noise = 0.0, coef = True, random_state = 1)
# Classification
----------------
# Generate features matrix and target vector
features, target = make_classification(n_samples = 100, n_features = 3, n_informative = 3,
                                       n_redundant = 0, n_classes = 2, weights = [.25, .75],
                                       random_state = 1)
# Clustering
------------
# Generate feature matrix and target vector
features, target = make_blobs(n_samples = 100, n_features = 2, centers = 3,
                              cluster_std = 0.5, shuffle = True, random_state = 1)
```

<a id="files"></a>
<h5 style="text-decoration:underline">Loading Files</h5>

If the data file is a `.csv` file then:<br>
```python
df = pd.read_csv(file_path,...)
```
otherwise replace `read_csv` with one of the following options below:

In [4]:
# Using regex to search for pandas methods for reading in different file types
files_to_read = []
for method in dir(pd):
    re_search = re.search(r"(read_\w{1,})",method)
    if re_search and method!='read_csv':
        files_to_read.append(re_search.group())
        
', '.join(files_to_read)

'read_clipboard, read_excel, read_feather, read_fwf, read_gbq, read_hdf, read_html, read_json, read_orc, read_parquet, read_pickle, read_sas, read_spss, read_sql, read_sql_query, read_sql_table, read_stata, read_table'

In [5]:
# As a list comprehension
print([re.search(r"(read_\w{1,})",method).group() for method in dir(pd)
 if re.search(r"(read_\w{1,})",method)!=None and method!='read_csv'])

['read_clipboard', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table']


There are two things to note about loading CSV files. First, it is often useful to take a quick look at the contents of the file before loading. It can be very helpful to see how a dataset is structured beforehand and what parameters we need to set to load in the file. Second, read_csv has over 30 parameters and therefore the documentation can be daunting. Fortunately, those parameters are mostly there to allow it to handle a wide variety of CSV formats. 

<a id="sql"></a>
<h5 style="text-decoration:underline">Querying a SQL Database</h5>

### Example Setup
```python
from sqlalchemy import create_engine

# Create a connection to the database
database_connection = create_engine('sqlite:///sample.db')

# Load data
dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)
```

<a id="ref"></a>
<h3><ins>REFERENCES</ins></h3>

<ins>BOOKS</ins>
- [Data Preparation for Machine Learning by Jason Brownlee](https://machinelearningmastery.com/data-preparation-for-machine-learning/)
- [Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning](https://gaurav320.github.io/vpspu.github.io/eb/pdf/ML.pdf)

<ins>ARTICLES / WEBSITES</ins>

<ins>PYTHON PACKAGES</ins>
