# Numeric data for Machine Learning

### Introduction

Being able to load your data is a fundamental first step in practically any machine learning project. The most common file format for such data is CSV (Comma-Separated Values). Python provides multiple ways to handle CSV files. Below are three standard approaches:

1. Loading CSV files using the python standard library (`csv` module).
2. Loading CSV files using NumPy (via `numpy.loadtxt` or `numpy.genfromtxt`).
3. Loading CSV files using pandas (via `pandas.read_csv`).

Each approach has different strengths:
- **Python standard library**: No external dependencies, but you need to parse or convert types manually.
- **NumPy**: Good for large numeric datasets, but might require manual handling of headers or text.
- **pandas**: Very flexible, automatically handles many typical data-loading tasks, supports numerous parameters for parsing complex CSV files, and creates a powerful `DataFrame` object that simplifies subsequent data analysis.

As a reference, you can learn a lot about CSV formatting conventions by reviewing the request for comment titled [Common Format and MIME Type for Comma-Separated Values (CSV) Files (RFC 4180)](https://tools.ietf.org/html/rfc4180).

Below are some points to keep in mind when working with CSV files:

**File header**:
- Does your CSV have a header row that labels each column? If yes, you can instruct your loading function to interpret the first row as the header. Otherwise, you may have to provide your own list of column names.

**Comments**:
- Some CSV files contain comment lines, typically starting with `#`. Depending on the method you use, you may need to specify that your file has comment lines or define which character is used to mark them.

**Delimiter**:
- The comma (`,`) is the default field delimiter for CSV files, but your data may use tabs (`\t`) or other delimiters (e.g., `;`). Make sure you specify the correct delimiter if it differs from the default.

**Quotes**:
- Fields in a CSV may include spaces or other special characters. Such fields are often enclosed in quotes (by default `"`). If your file uses a different quote character, you must specify it to ensure the data is parsed correctly.

Understanding these aspects of your CSV file helps avoid errors and ensures the data is loaded consistently.

### Installing Python libraries
One great feature about jupyter notebooks is that we can run terminal commands. This means we can install python libraries on the fly, using the `!` prefix. If you plan on running these notebooks on your own machine, you'll need to install a few libraries as and when they are required. Below is an example specifically installing `pandas` and `numpy`.

In [None]:
!pip install --upgrade pip

!pip install pandas numpy

## Pima Indians dataset

Throughout this notebook, we will use the **Pima Indians dataset** to demonstrate how to load data into python. This dataset describes medical records for Pima Indians, indicating whether or not each patient develops diabetes within five years.

The Pima Indian diabetes dataset is a renowned benchmark in the field of machine learning. It was originally made available through the *National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)* in the United States, and later hosted on the *UCI Machine Learning Repository*. Over time, it has become a standard reference dataset for illustrating and evaluating classification algorithms.

### Who was studied

- The dataset focuses on adult female patients (aged 21 or older) of Pima Indian heritage residing near Phoenix, Arizona.  
- This population is known to have a significantly higher incidence of type 2 diabetes.

### What was measured

Each entry in the dataset corresponds to one participant and includes **8 numerical attributes** capturing medical and demographic information:

1. **Number of times pregnant** (often referred to as “pregnancies”).  
2. **Plasma glucose concentration** (after a 2-hour oral glucose tolerance test).  
3. **Diastolic blood pressure** (mm Hg).  
4. **Triceps skin fold thickness** (mm).  
5. **2-hour serum insulin** (mu U/ml).  
6. **Body mass index (BMI)**.  
7. **Diabetes pedigree function (DPF)** – an indication of diabetes likelihood based on familial and genetic risk.  
8. **Age** (years).

In addition, each record contains a **binary outcome** (“class”) indicating whether or not the participant developed **type 2 diabetes** within five years.

### Why it was collected

Researchers aimed to investigate risk factors for diabetes among a group with a particularly high risk of the disease. Various medical and demographic data (such as glucose tolerance tests, insulin measurements, and age) were gathered to determine which factors most strongly predicted the onset of diabetes.

### How the data was obtained

1. **Patient recruitment**  
   Eligible participants were female Pima Indians, aged 21 or older.  

2. **Measurements and testing**  
   Each participant underwent standard medical tests, including measuring plasma glucose concentration, blood pressure, and insulin levels, alongside providing demographic details such as age and number of pregnancies.  

3. **Five-year follow-up**  
   The pivotal outcome was whether each participant experienced the onset of type 2 diabetes within five years. This information was determined through follow-up medical records and diagnoses.  

4. **Compilation**  
   The anonymised data were assembled into a structured dataset of 768 entries, each representing a single participant’s measurements and diabetes outcome (onset or no onset).

The Pima Indian diabetes dataset remains a pivotal resource for demonstrating fundamental classification techniques and exploring how demographic and medical attributes can help predict the onset of type 2 diabetes.

## Download the dataset

In [None]:
import urllib.request

url = 'https://raw.githubusercontent.com/martyn-harris-bbk/AppliedMachineLearning/main/data/pima-indians-diabetes.data.csv'
filename = 'pima-indians-diabetes.data.csv'

urllib.request.urlretrieve(url, filename)
print("Download complete.")

### Load csv from file

In the code cell above, we demonstrate how to read a CSV file from your local system. We do this by providing the file path in the `filename` variable, and specifying a list of column names (`header`). If your CSV already contains a header row, you could set `header=0` (or omit the `names=` parameter entirely) to tell pandas to use the first row of the file as the header.

The `.read_csv()` function returns a `pandas.DataFrame` object, which is a powerful 2D data structure that allows row and column operations, descriptive statistics, and data manipulations. You can immediately start summarising and visualising the data using DataFrame methods such as `.describe()`, `.head()`, `.plot()`, and so on.

For more information on `pandas.DataFrame`, see the [API documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [None]:
import pandas as pd

filename = 'pima-indians-diabetes.data.csv'

header = [
    'Pregnancy_Count',
    'Glucone_conc',
    'Blood_pressure',
    'Skin_thickness',
    'Insulin',
    'BMI',
    'DPF',
    'Age',
    'Class'
]

data = pd.read_csv(filename, names=header)

### Load csv using pandas from url

In many cases, you may want to load CSV data directly from a web resource. Below, we show you how to modify the example to read the Pima Indians data from a GitHub URL, without having to download it locally first.

In [None]:
url = 'https://raw.githubusercontent.com/martyn-harris-bbk/AppliedMachineLearning/main/data/pima-indians-diabetes.data.csv'
header = ['Pregnancy_Count','Glucone_conc','Blood_pressure','Skin_thickness','Insulin','BMI','DPF','Age','Class']

data = pd.read_csv(url, names=header)

# Viewing our data

After successfully loading a dataset into a `DataFrame`, a critical next step is to inspect the structure of the data. We often want to do a quick sanity check to make sure the columns are parsed correctly and the data types match our expectations.

One quick way to do this is by using the `data.head()` function, which shows the first 5 rows of the DataFrame by default. This lets you see if each column is in the correct position and whether the data values look reasonable.

In [None]:
data.head()

By default, `head()` returns 5 rows, but you can specify exactly how many rows you want to preview by passing an integer. For example, `data.head(20)` will show the first 20 rows.

Previewing just the first few rows is extremely helpful for verifying that the data has been read in correctly, especially if you suspect issues with delimiters, headers, or quoting.

In [None]:
data.head(20)

Similarly, `data.tail()` returns the last few rows of the DataFrame. This can help you inspect how data is structured near the end of the file and confirm if the file terminates properly.

In [None]:
data.tail()

# What is the dimensionality of our data?

We generally want to know the overall shape of the data (i.e., how many rows and columns). The `DataFrame.shape` attribute provides this as a tuple `(rows, columns)`.

In machine learning contexts, the number of rows typically corresponds to the number of examples, and the number of columns represents your features (plus, optionally, a target column if included in the data). Knowing these figures is essential for planning data splitting, memory usage, and subsequent model training.

In [None]:
print(data.shape)

From the tuple returned, we can see that the Pima Indians dataset contains **768 rows** and **9 columns**. The 9 columns correspond to 8 explanatory features plus 1 target column (`Class`).

This knowledge helps us ensure we have the complete dataset loaded. It's also a good initial check before we proceed to more in-depth data profiling or feature analysis.

## Exploring more details of the data

Beyond checking the first few and last few rows, pandas offers convenient methods to quickly summarise your dataset. For instance, if you want to:

- View column data types and any non-null counts, you can use `data.info()`. This can be useful to spot missing values or confirm that columns are numeric.
- Get a statistical summary of your numeric columns, you can use `data.describe()`, which provides statistics such as mean, standard deviation, minimum, and maximum values.

Let us look at some of these methods in action.

In [None]:
# Checking data info
data.info()

From `data.info()`, you can see how many rows have non-null values in each column and the data type (`int64`, `float64`, `object`, etc.). This can highlight if some columns are missing data or are stored in unexpected types.


### Getting a statistical summary of numerical columns

Using `data.describe()`, you can quickly observe general statistics about each numeric column, such as:

- **count**: The number of non-missing values.
- **mean**: Average value.
- **std**: Standard deviation, a measure of spread.
- **min** and **max**: The lowest and highest value observed.
- **25%**, **50%**, **75%**: The quartiles, which help you understand the distribution.

In [None]:
data.describe()

These statistical measures are crucial for initial exploratory data analysis, which guides subsequent data cleaning, feature engineering, and model building. We will explore some of these statistical measures in more depth as we progress.

Here's the adapted version with the bulleted text in *italics* instead of **bold**:

---

## Recommended datasets

Here are some widely used, freely available numeric datasets you might explore for practice:

### Numeric datasets

**Iris**  
- *Description*: Classic dataset of 150 iris flowers with four features each (sepal length, sepal width, petal length, petal width) and three species of iris as the target.  
- *Why it’s popular*: Very small (perfect for quick demos) and well-labelled, making it easy to visualise in 2D or 3D.  
- *Where to get it*: Built into scikit-learn (`sklearn.datasets.load_iris`) or from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Iris).

**Wine**  
- *Description*: Chemical analysis of wines grown in the same region in Italy but from three different cultivars. Each sample has 13 numeric features.  
- *Why it’s popular*: Good example for classification with multiple classes.  
- *Where to get it*: Built into scikit-learn (`sklearn.datasets.load_wine`) or from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wine).

**Wine Quality**  
- *Description*: Wine samples (red or white) with attributes such as acidity, sulphates, pH, and a quality rating.  
- *Why it’s popular*: Demonstrates regression or classification.  
- *Where to get it*: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

**Adult**  
- *Description*: Census data (48,842 instances) used for predicting whether an individual’s income exceeds $50K/year.  
- *Why it’s popular*: Classification with categorical and numeric features, plus data-cleaning challenges.  
- *Where to get it*: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult).

**Titanic**  
- *Description*: Passenger data about who survived/perished on the RMS Titanic.  
- *Why it’s popular*: Classic Kaggle competition for beginners, with mixed feature types and missing data.  
- *Where to get it*: [Kaggle Titanic Competition](https://www.kaggle.com/c/titanic).