# Data Analytics

## Ingesting Data with Pandas

![Python and Pandas!](images/PythonPandasandDataIngestion.png)

## OBJECTIVES: 

- What is Pandas
- Pandas & NumPy
- Pandas and Jupyter Notebooks
- What Pandas can do
- Reading Data from: 
    - CSV files
    - Excel files
    - SQL databases
- Hands-on Data Wrangling
- In-Class Group Activity

## What is Pandas

- **Pandas** - An open-source Python package that is widely used
- Built on top of NumPy (supports 1+ D arrays)
- **Stands for either**: 
    1. Panel Data 
    2. Python Data Analysis
- Created by Wes McKinney in 2008
- **NOTE**: In curriculum are two additional links on Pandas

### NOTES
>
> ## What is Pandas
> 
> - An open-source Python package that is most widely used for data science/data analysis and machine learning tasks. 
> - Built on top of NumPy which provides support for multi-dimensional arrays.
> - References both “Panel Data” and “Python Data Analysis”
> - The name Pandas is derived from the word "Panel Data"
> - Created by Wes McKinney in 2008
> - Official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide
> - Community tutorials: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

## Pandas & NumPy - These two libraries are the best within data science
| **Pandas** | **NumPy** |
| ---- | ---- |
| A high-level data manipulation tool built on NumPy | Supports large 1+ D arrays and high-level mathematical functions |
| **Dataframe (df)** - Structured like a table or spreadsheet (rows and columns). Uses some NumPy functions. | |
| Uses Series | Uses ndarray's |
| Greater memory and slower | Less memory and faster |
| Mainly works with tabular data | Works with numerical data |

### NOTES
> 
> ## Pandas & NumPy
> 
> - NumPy is a library that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
> - Pandas is a high-level data manipulation tool that is built on the NumPy package
> - Pandas offers an in-memory 2d table object called a DataFrame
> - A DataFrame is structured like a table or spreadsheet -- with rows and columns
> - There are a few functions that exist in NumPy that we use specifically on Pandas DataFrames
> - Just as the "ndarray" is the foundation of NumPy, the "Series" is the core object of Pandas
> - NumPy consumes less memory than Pandas, and is faster than Pandas
> - These two libraries are the best libraries for data science applications
> - Pandas mainly works with tabular data, whereas NumPy works with numerical data


## Pandas & Jupyter Notebooks
- Benefits to using Pandas within Jupyter Notebooks:
    - A good environment for data exploration and modeling 
    - Ability to execute code in a particular cell, opposed to one large file (saves time)
    - Can easily visualize dataframes and plots 

### NOTES
> 
> ## Pandas & Jupyter Notebooks
> 
> Jupyter Notebooks offer a good environment for using pandas to do data exploration and modeling, but pandas can also be used in text editors just as easily.
> 
> Jupyter Notebooks give us the ability to execute code in a particular cell as opposed to running the entire file. This saves a lot of time when working with large datasets and complex transformations. 
> 
> Notebooks also provide an easy way to visualize pandas’ DataFrames and plots.

## What can Pandas do?
- **Perform 5 data analysis steps**:
    1. load
    2. manipulate
    3. prepare
    4. model
    5. analyze
- It takes data files (e.g., CSV, TSV, SQL) and creates a dataframe (with rows and columns)
- World-leading Data Scientists ranked it *The Best Python Data Analysis and Manipulation Tool*
- **Pandas can do**:

|    |    |
|----|----|
| Data Cleansing | Data fill |
| Data normalization | Merges and joins |
| Data visualization | Statistical analysis |
| Data inspection | Loading and saving data |

### NOTES

## What can Pandas do?

Pandas can perform five significant steps required for processing and analysis of data, irrespective of the origin of the data, -- load, manipulate, prepare, model, and analyze.

What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called a 'data frame' that looks very similar to table representation in statistical software (think Excel).

In fact, with Pandas, you can do everything that makes world-leading data scientists vote Pandas as the best Python data analysis and manipulation tool available.

---

## Installing and Using Pandas
- Must install Pandas, and NumPy is required:
    - **Windows**: `pip install pandas`
    - **Mac**: `pip3 install pandas` or `python3 install pandas`
- After installed, must import each time you use the library:
    - **Syntax Example**: `import pandas as pd`

### Reading data from `CSV files` into a `DataFrame`:
- `pd.read_csv()` - Retrieves CSV file data to a dataframe
- Data is usually separated by commas (default). 
    - Other separators include: 
        - semi-colon (';')
        - colon (':')
        - vertical bar ('|')
        - tab ('\t')
    - To change separator, use `sep='<delimiter>'`
        - Example: `df = pd.read_csv(file_path, sep='|')`
- Can open in Notepad but format will be off. Better to use VS Code

### NOTES
>
> ### Reading data from `CSV files` into a `DataFrame`:
> Read all about the [Syntax and use of `.read_csv()`.](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
>
> A simple way to store big data sets is to use CSV files (comma separated files).
> 
> CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
> 
> You can open it in Notepad but the format will be off; Use VS Code instead.
> 
> To access data from the CSV file, we require a function `.read_csv()` that retrieves data in the form of the DataFrame.
> 
> By default, a `CSV` is separated by commas. But one can use other separators as well. 
> 
> The `pandas.read_csv()` function is not limited to reading the CSV file with default separator (i.e. comma). It can be used for other separators such as `;` or `|` or `:`. 
> 
> To load CSV files with such separators, the `sep=` parameter is used to pass the separator used in the CSV file. Example: --
> ```python
>     f = pd.read_csv("datafile2.csv", sep='|')
> ```
> 
> For our example, we'll use a file from the resources folder in the curriculum. The filepath to the CSV file is `./resources/GREENCOMPUTERS500.csv`.
> 
> Lets first look at the data by opening the raw CSV in VSCode -> [Top 500 Green Computers](resources/GREENCOMPUTERS500.csv)
> 
> From this data we can see that we have a file with many columns.
> 
> Lets see what it looks like when we import the data into a DataFrame ...

In [None]:
# FOLLOW ALONG: importing a CSV file
import pandas as pd

In [None]:
### NOTES: importing a CSV file
# In curriculum, use file: 'GREENCOMPUTERS500.csv'
# First view dataset in curriculum (raw file)
# index_col -> columns to use as the row labels of the DataFrame. In this case,
# column 0 of the CSV (Rank), will be used as the index label for our rows.
green = pd.read_csv('./resources/GREENCOMPUTERS500.csv',index_col=0)
green.info()
green

## Pandas Data Wrangling with a CSV file

Next, we will use the 'data.csv' file under the resources folder in the curriculum

### NOTES
>
> ## Pandas Data Wrangling with a CSV file
>
> We will reuse the data file we introduced in the 1st Pandas session. For our example, we will use a file from the resources folder in the curriculum. 
>
> The filepath to the CSV file is `./resources/data.csv`

In [None]:
# FOLLOW ALONG: read and print a summary of a DataFrame
# Same print commands from Section 5.2
import pandas as pd
df = pd.read_csv("./resources/data.csv")

In [None]:
### NOTES: read and print a summary of a DF

# Print first and last 5 rows (if default)
df

# Print first 10 rows
print(df.head(10))

# Print last 12 rows
print(df.tail(12))

# Print summary of number of columns, column labels, data types, memory usage, range index, and non-null values
df.info()

### A closer look at the DataFrame Info …

![DataFrame Info Display](images/Pandas_DF_InfoDisplay.png)

### Gathering Summary Statistics

In [None]:
df.describe()

### Understanding Mean, Median, and Range

In data analysis, summarizing datasets with basic statistics can provide valuable insights. Three fundamental statistics are mean, median, and range. Let's explore what each of these terms means and why they're important.

## Mean (Average)

The **mean** is what most people commonly refer to as the "average." It's calculated by adding up all the numbers in a set and then dividing by the count of those numbers.

For example, the mean of 2, 3, and 10 is `(2 + 3 + 10) / 3 = 5`.

The mean provides a central value for the dataset but can be affected by outliers (extremely high or low values).

## Median (Middle Value)

The **median** is the middle value in a dataset when the numbers are arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle numbers.

For instance, in the set 1, 3, 3, 6, 7, 8, 9, the median is 6. In the set 1, 2, 3, 4, the median is `(2 + 3) / 2 = 2.5`.

The median is useful because it is not skewed by outliers, making it a better measure of central tendency for skewed distributions.

## Range (Spread of Data)

The **range** indicates the spread of data by showing the difference between the highest and lowest values in the set.

To calculate the range of 1, 3, 3, 6, 7, 8, 9, you subtract 1 (the lowest number) from 9 (the highest number), giving a range of 8.

The range gives us a quick sense of the variability in the dataset, but it doesn't tell us how the values are distributed between the highest and lowest points.

---

Pandas uses the `mean()`, `median()` and `mode()` methods to calculate the respective values for a specified column.

- **Mean** = the average value
- **Median** = the value in the middle, after you have sorted all the values ascending
- **Mode** = the value that appears most frequently

In [None]:

mean_value = df['column_name'].mean()
median_value = df['column_name'].median()
range_value = daf['column_name'].max() - df['column_name'].min()

print(f"Mean: {mean_value}, Median: {median_value}, Range: {range_value}")


### More practice with CSV files: Group Activity
- CSV file we will use in the resources folder: ('resources/titanic.csv').

### NOTES
>
> ### More practice with CSV files - Titanic
> 
> First, we need to gather our data.
> 
> We can either use the data from our resources directory, or we can import our data from the WEB.
> 
> The filepath to the CSV file in the curriculum resources folder is ["./resources/titanic.csv"](resources/titanic.csv).
> 
> Else, the URL to the data file on the WEB is ... https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
> 
> You can download that file to your machine, or we will pull that file directly in our code.
