# Introduction to Data Science

## What is data science?

Data science is an interdisciplinary field that incorporates math and statistics, computer science, and domain knowledge to extrapolate knowledge and insights from data and solve real world problems. This is different from what we consider as traditional research, which does not normally incorporate computer science or programming. 

There are several components to the **data science lifecycle**. 

1. Asking questions and defining problems that need to be solved.

2. Gathering data to address the question.

3. Fixing inconsistencies and missing values within the data.

4. Visualizing the data to identify patterns or trends.

5. Analyzing the patterns or trends to draw conclusions. 

Note that the data science lifecycle can be a never-ending cycle as we continue to take a progressively deeper dive into the data and formulate additional questions.  

## Data cleaning and data exploration

There are two key steps within the data science lifestyle: data cleaning and data exploration. These two steps can be a never-ending cycle.

**Data cleaning** is the process of transforming the raw data to allow for data analysis. This typically involves filling in missing data, extrapolating data from text, standardizing data formats, etc.

**Data exploration** is the process of understanding the data through visualizations and analysis. This involves generating high- and low-level analysis of the data, including visualizing the data, to help develop hypotheses and resolve any underlying issues within the data. 

## Core data science libraries

There are different sets of python libraries we can use for each step of the data science process.

**Pandas** is the main library often used in data cleaning and data exploration. The data structures and operations within Pandas allow for ease of data manipulation and analysis. 

**Matplotlib, Seaborn, and Plotly** are standard libraries used for data visualizations. Matplotlib provides the basic framework for static plots, which is built upon by Seaborn for better integration with Pandas. Plotly is used to generate more interactive visualizations.

**Scikit-learn and Statsmodels** are standard libraries used for data analysis, specifically machine learning and statistical analysis functions respectively. 

While not a data science specific library, the **Numpy** library is often used for mathematical data manipulation. 

## Introduction to Pandas and DataFrames

Pandas is a standard Python library used for analysis of tabular data. You can think of the structure of tabular data as a matrix or a spreadsheet. 

There are two main types of data structures within Pandas: DataFrame and Series. A **DataFrame** is a two-dimensional labeled tabular data structure consisting of rows and columns. A **Series** is a one-dimensional labeled array data structure. This is typically a singular row or column.

![Data-Science-1.jpg](attachment:Data-Science-1.jpg)

When we look at a DataFrame, we can identify several distinct attributes. A **row** is a horizontal line of data that typically encodes an individual observation. A **column** is a vertical line of data that typically encodes the specific features of each row or observation. The topmost horizontal line represents the **column labels**, while the left most vertical line represents the **row labels or index**. The **shape** of the DataFrame stores the dimensions.

![Data-Science-2.jpg](attachment:Data-Science-2.jpg)

When we select a singular row or column from a DataFrame, we get a Series, which is a labeled array of data. Notice that the structure and metadata differ between a singular column and a singular row selection. 

In a singular row selection, the labels of the Series are the column labels. In addition, the dtype (data type) is an object because the data in the Series contains mixed data types (float and String), which commonly occurs for row selections. In a singular column selection, the labels of the Series are the row labels or index. In this case, the dtype is float64 as all the data contained in the Series are floats. 

One useful feature of DataFrame is that the structure and stored data can easily be changed, simplifying the data cleaning process. Data within a DataFrame can be edited similarly as variable assignment. Many Python and NumPy functions can be applied to an entire row or column. DataFrames also have intrinsic functions that can modify the entire DataFrame. We will explore more of these functionalities later in this module. 

## Combining DataFrames

![Data-Science-3.jpg](attachment:Data-Science-3.jpg)

Sometimes we will need to work with multiple datasets or DataFrames. As a result, we would need to combine DataFrames by using either the `join` or `merge` method. In Pandas, the `join` method is normally based on indexes, while the `merge` method is based on columns. Both `join` and `merge` methods can produce equivalent resulting DataFrames.

There are many ways to join (or merge) DataFrames, each of which produces different results. 

- **Left join/merge**: All rows from the left DataFrame joins/merges with only matching rows from the right DataFrame. Non-matching rows from the right DataFrame are removed.

- **Right join/merge**: All rows from the right DataFrame joins/merges with only matching rows from the left DataFrame. Non-matching rows from the left DataFrame are removed.

- **Inner join/merge**: Only matching rows from both the right and left DataFrame joins/merges. Non-matching rows from both the right and left DataFrame are removed.

- **Outer join/merge**: All rows from both the right and left DataFrame joins/merges, irrespective of whether the corresponding matching row exists.

- **Cross join/merge**: All rows from the right DataFrame crosses with all rows from the left DataFrame. 

Note that cross join/merge (Cartesian product) can result in a DataFrame that is larger than either of the DataFrames as all rows of the left DataFrame are paired up with all rows of the right DataFrame. In contrast, the other join/merge methods (union or intersection) can at max be as large as one of the DataFrame. 

The order of operation is an important consideration when combining DataFrames. Generally, the resulting DataFrame from the left joining/merging to the right is not equivalent to the right joining/merging to the left. 

## Visualizations

Visualizations are used to help depict relationships, patterns, or trends that cannot be easily described by numbers or text alone. 

![Data-Science-4.jpg](attachment:Data-Science-4.jpg)

There are many attributes that make up a plot as shown above, each of which can be manipulated to change the way we display the data in a graphical format. Notice that we can also divide a figure into multiple subplots.

![Data-Science-5.jpg](attachment:Data-Science-5.jpg)

In general, the type of variable used for visual comparison determines the type of plot we can use. For example:

- Top Left: Bar graphs are used to compare the distribution of qualitative variables (penguin species)

- Top Right: Histograms are used to compare the distribution of qualitative continuous variables (penguin flipper length).

- Bottom Left: Violin plots are used to compare the distribution of quantitative continuous variables (penguin body mass) across qualitative categories (penguin species).

- Bottom Right: Scatter plots are used to compare the relationship between quantitative continuous variables (penguin flipper length and bill length).

Remember that the goal of making visualizations is to help us understand the data and communicate the results to others. The way plots are presented can vastly change the way we interpret the data and make conclusions. 

As a best practice, it is recommended to choose the appropriate plot based on the variable types and stylize the plot appropriately to provide addition contexts. This includes:

- Axis scale and limits

- Color mapping and markings

- Avoiding overcrowding and cramming

- Transforming data to uncover underlying relationships

## Statistical Analysis Overview

The final part of the data science lifecycle is analyzing the data for patterns or trends to draw conclusions, which is typically done via statistical analysis. 

![Data-Science-6.jpg](attachment:Data-Science-6.jpg)

There are two basic types of statistical analysis: hypothesis testing and regression modeling. These two types of statistical analysis are used to address different types of questions, as followed:

- **Hypothesis testing**: Are the differences between the groups statistically significant?

- **Regression modeling**: Based on the existing data, what would be the prediction?

![Data-Science-7.jpg](attachment:Data-Science-7.jpg)

We can further classify the different types of hypothesis testing based on the variable types. If we are **comparing a numerical variable (mean)** of each group, we can use either a t-test or ANOVA, which requires stratification of the dataset into groupings. In contrast, if we are **comparing a categorical variable (proportion)** of each group, we can use either Fisher’s exact test or a chi-squared test, which requires creating a contingency table.

![Data-Science-8.jpg](attachment:Data-Science-8.jpg)

There are two basic types of regression models used for statistical analysis: linear regression and logistic regression. **Linear regression** is used to make numerical prediction, while **logistic regression** is used to classify categorical values. Both regression models are created based on the existing labeled data.

## Regression Modeling

There are many different types of regression models and machine learning algorithms, most of which follow the same modeling process with some variations.

1. Split the data into training and testing datasets.

2. Create the $X$ (observations) and $Y$ (responses) matrix for both datasets.

3. Fit the chosen model using the training dataset.

4. Evaluate the model using the testing dataset. 

In general, regression models are typically some variations of $Y = X\theta$, where $X$ is the observation, $Y$ is the response matrix, and $\theta$ is the model parameter matrix.

As a quick mathematical review, matrices are essentially numerical tabular data where rows represent individual observations and columns represent features or characteristics. Below are some mathematical considerations with respect to the general regression model equation.

![Data-Science-9.jpg](attachment:Data-Science-9.jpg)