# How To Load Machine Learning Data

1. Load CSV Files with the Python Standard Library.
2. Load CSV Files with NumPy.
3. Load CSV Files with Pandas

## First example using csv

- The Python API provides the module CSV and the function reader() that can be used to load CSV files
- you can download data cancer from : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

In [41]:
# Load CSV using NumPy
import csv
import numpy as np

filename = '../data/05/data.csv'
print(type(csv.reader(filename, delimiter=',')))

with open(filename, 'r') as f:
    data_cancer = list(csv.reader(f, delimiter=','))
#data_cancer

<class '_csv.reader'>


Some explanations:

```python
filename = '../data/05/data.csv' ①
with open(filename, 'r') as f:
    data_cancer = list(csv.reader(f, delimiter=',')) ②
```
① Filename repository

② Call the function **reader()** from **csv** library which take 2 arguments:
    - filename
    - type of delimiter; here for example the delimiter is the comma

In [30]:
raw_data = open(filename, 'r')
reader = csv.reader(raw_data, delimiter=',')

next(reader)
x = list(reader)
data_cancer = np.array(x).astype('str')

Some explanations:

```python
raw_data = open(filename, 'r') ①
next(reader) ②
x = list(reader) ③
data_cancer = np.array(x).astype('str') ④
```
① Open the file on **read** mode

② Skip the header line

③ Convert the reader to list

④ Convert the list to numpy array using string type

In [384]:
fractal_dimension_worst = [float(item[-1]) for item in data_cancer[1:]]

Some explanations:

```python
fractal_dimension_worst = [float(item[-1]) for item in data_cancer[1:]] 
```
Same result of :
```python
fractal_dimension_worst = []
for item in range(data_cancer[1:]:
    fractal_dimension_worst.append(float(item[-1]))
```
- data_cancer[1:] : read from the second line
- item[-1] : read the last column

## Compute the mean of the <span style="color:green"> "fractal_dimension_worst" </span> column
### Basic 

In [385]:
print("The mean is :", sum(fractal_dimension_worst) / len(fractal_dimension_worst))

The mean is : 0.08394581722319855


### Using the function mean() of numpy

In [386]:
print("The mean is :",np.mean(fractal_dimension_worst))

The mean is : 0.0839458172231986


## Manipulating files using numpy

### Creating A NumPy Array

In [434]:
### import csv
import numpy as np

fractal_dimension_worst = np.array(fractal_dimension_worst, dtype=np.float)
#print(fractal_dimension_worst)

### Shape of the array

In [435]:
fractal_dimension_worst.shape

(569,)

### Using NumPy to Read In Files : function genfromtxt()

- Once the file is defined and open for reading, genfromtxt splits each non-empty line into a sequence of strings. 

- Empty or commented lines are just skipped. 

- The delimiter keyword is used to define how the splitting should take place.

In [453]:
data_cancer = np.genfromtxt(filename, delimiter=",", skip_header=1, dtype=np.str)
type(data_cancer)

numpy.ndarray

Starting from the first line
```Python
array([['842302', 'M', '17.99', ..., '0.2654', '0.4601', '0.1189'],
       ['842517', 'M', '20.57', ..., '0.186', '0.275', '0.08902'],
       ['84300903', 'M', '19.69', ..., '0.243', '0.3613', '0.08758'],
       ...,
       ['926954', 'M', '16.6', ..., '0.1418', '0.2218', '0.0782'],
       ['927241', 'M', '20.6', ..., '0.265', '0.4087', '0.124'],
       ['92751', 'B', '7.76', ..., '0', '0.2871', '0.07039']], dtype='<U9')
```

- The presence of a header in the file can hinder data processing. 

- In that case, we need to use the **skip_header** optional argument. 

    - The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. 

- Similarly, we can skip the last n lines of the file by using the **skip_footer** attribute and giving it a value of n.

In [391]:
data_cancer = np.genfromtxt(filename, delimiter=",", skip_header=3, dtype=np.str)

Starting from the 3 line
```Python
array([['84300903', 'M', '19.69', ..., '0.243', '0.3613', '0.08758'],
       ['84348301', 'M', '11.42', ..., '0.2575', '0.6638', '0.173'],
       ['84358402', 'M', '20.29', ..., '0.1625', '0.2364', '0.07678'],
       ...,
       ['926954', 'M', '16.6', ..., '0.1418', '0.2218', '0.0782'],
       ['927241', 'M', '20.6', ..., '0.265', '0.4087', '0.124'],
       ['92751', 'B', '7.76', ..., '0', '0.2871', '0.07039']], dtype='<U9')
```

In [442]:
data_cancer = np.genfromtxt(filename, delimiter=",", skip_footer=2, dtype=np.str)
type(data_cancer)

numpy.ndarray

### First element of the data_cancer array

In [441]:
data_cancer[0]

array(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean',
       'concavity_mean', 'concave points_mean', 'symmetry_mean',
       'fractal_dimension_mean', 'radius_se', 'texture_se',
       'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se',
       'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'], dtype='<U23')

In [393]:
data_cancer[0][0]

'84300903'

## Load CSV using Pandas

- You can load your CSV data using Pandas and the pandas.read csv() function. 

- This function is very flexible and is perhaps my recommended approach for loading your machine learning
data. 

- The function returns a pandas.DataFrame (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [33]:
import pandas as pd

filename = '../data/05/data.csv'
data_cancer = pd.read_csv(filename)
#print(data_cancer)

### Data access

In [34]:
data_cancer["id"]

0        842302
1        842517
2      84300903
3      84348301
4      84358402
         ...   
564      926424
565      926682
566      926954
567      927241
568       92751
Name: id, Length: 569, dtype: int64

In [35]:
data_cancer["perimeter_mean"]

0      122.80
1      132.90
2      130.00
3       77.58
4      135.10
        ...  
564    142.00
565    131.20
566    108.30
567    140.10
568     47.92
Name: perimeter_mean, Length: 569, dtype: float64

### Drop column from the dataFrame

In [36]:
df = data_cancer.drop(columns="perimeter_mean")
df

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,M,20.60,29.33,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


### Access a group of rows and columns by label(s) or a boolean array.


In [37]:
selec = df.loc[ df['diagnosis'] == "M"]
selec

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,...,25.38,17.33,184.60,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.11890
1,842517,M,20.57,17.77,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,...,24.99,23.41,158.80,1956.0,0.1238,0.1866,0.2416,0.1860,0.2750,0.08902
2,84300903,M,19.69,21.25,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,...,23.57,25.53,152.50,1709.0,0.1444,0.4245,0.4504,0.2430,0.3613,0.08758
3,84348301,M,11.42,20.38,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,...,14.91,26.50,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.17300
4,84358402,M,20.29,14.34,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,...,22.54,16.67,152.20,1575.0,0.1374,0.2050,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
563,926125,M,20.92,25.09,1347.0,0.10990,0.22360,0.31740,0.14740,0.2149,...,24.29,29.41,179.10,1819.0,0.1407,0.4186,0.6599,0.2542,0.2929,0.09873
564,926424,M,21.56,22.39,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,...,25.45,26.40,166.10,2027.0,0.1410,0.2113,0.4107,0.2216,0.2060,0.07115
565,926682,M,20.13,28.25,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,...,23.69,38.25,155.00,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637
566,926954,M,16.60,28.08,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,...,18.98,34.12,126.70,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.07820


## Load CSV using Pandas from URL

In [40]:
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"

data = pd.read_csv(url)
data

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


In [None]:
# css style
from IPython.core.display import HTML
def css_styling():
    styles = open("../../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()