# <font color=Green> Lecture 4: Downloading and Reading the dataset </font>

## Learning Objectives:
### Students will be able to :
* Download datasets using OS commands
* Download datasets using Pandas read_csv function 
* read datasets using Pandas read_csv function into  Pandas Dataframe
* Fix file format problems with Pandas read_csv function
* Replace missing values symbols in the Pandas Dataframe with null (Nan) values
* save dataset in a csv file


-------------------------

**First we need to import the Pandas package**

In [1]:
import pandas as pd
import numpy as np

## What is Pandas Dataframe?
Dataframes are used to load datasets from files in order to process them. Here we will use dataframes to read the data, clean the data, process the data, and query and select subsets from the data to plot it.

Pandas dataframe is a primary container of data. 
it could be indexed using rows and columns. Each row/column  can be indexed by its **label** or its **integer order index** (0 for the 1st row/column, n-1 for the last row/column). 



<img src="DataFrameComponents.png" width="650" >

## Why Pandas Dataframe?
 * Efficiently handle Large datasets
 * Has tons of ready and easy to use data manipulation functions
 * You can write a script that can be used with different datasets (write it once, use it more)
 

## <font color=blue> 1- Download the dataset file using OS  command (e.g. ``curl``, ``wget``)</font>

* Sometimes you want to download the dataset file to a local directory. Using the magic sysmbol `!` you can run can run os commands in the notebook by preceeding the command with `!` symbol. Since windows and Linux based operating systems have command to download files we can use them here.

Note: You can get more information about StoneFlakes dataset (like, the column names and the meaning of the ID values) from the dataset's web site at (https://archive.ics.uci.edu/ml/datasets/StoneFlakes#)  


In [2]:
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/00299/StoneFlakes.dat -o StoneFlakes.dat
# Linux !wget http://archive.ics.uci.edu/ml/machine-learning-databases/00299/StoneFlakes.dat

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  3523  100  3523    0     0  28184      0 --:--:-- --:--:-- --:--:-- 28184


Let's look at the first few lines. 
Using Linux command head or 

In [3]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is B2B3-5C73

 Directory of C:\SP2021\Visualization\Lectures\W04

01/29/2021  11:50 AM    <DIR>          .
01/29/2021  11:50 AM    <DIR>          ..
01/29/2021  11:44 AM    <DIR>          .ipynb_checkpoints
01/28/2021  10:14 PM            31,709 DataFrameComponents.png
01/28/2021  09:09 PM           447,184 L04-part1.pptx
01/29/2021  11:48 AM             3,523 StoneFlakes.dat
01/29/2021  11:50 AM            71,406 W04-Pandas.ipynb
01/28/2021  09:07 PM           383,978 W04-part1.pdf
               5 File(s)        937,800 bytes
               3 Dir(s)  28,971,831,296 bytes free


In [4]:
#Linux, mac
#!head StoneFlakes.dat
#Windows
!more StoneFlakes.dat 

ID    LBI   RTI  WDI FLA  PSF  FSF ZDF1 PROZD
ar      ?,35.30,2.60,  ?,42.4,24.2,47.1,69
arn  1.23,27.00,3.59,122, 0.0,40.0,40.0,30
be   1.24,26.50,2.90,121,16.0,20.7,29.7,72
bi1  1.07,29.10,3.10,114,44.0, 2.6,26.3,68
bi2  1.08,43.70,2.40,105,32.6, 5.8,10.7,42
bie  1.39,29.50,2.78,126,14.0, 0.0,50.0,78
bn   1.31,26.30,2.10,119,15.7,15.7,30.4,72
bo   1.27,27.60,3.50,116,16.8,23.0,35.2,69
by   1.11,32.60,2.90,113,15.8,15.8,15.0,57
c    1.32,29.50,2.57,121,22.0, 2.0,18.0,63
cl   1.16,33.40,2.30,131, 7.5,14.9, 6.0,60
d    1.23,27.60,2.83,121,27.9, 6.7,31.7,67
e1   1.24,25.50,3.60,113, 5.3,21.1,60.0,86
e2   1.20,27.70,3.40,108,37.4, 9.9,39.9,71
ey   1.33,29.40,2.30,120,13.5, 6.9,38.3,72
fli  1.11,35.10,3.51,123,67.2, 2.5, 5.9,46
g10  1.14,29.90,2.90,116, 5.3,55.3,63.2,84
g11  1.16,27.90,3.00,116, 3.5,23.7,64.8,85
g2   1.22,29.50,3.00,116, 6.0,39.7,50.2,78
g4   1.23,29.00,2.90,116, 8.0,24.0,60.0,87
g5   1.25,30.20,2.80,118, 7.5,37.7,43.9,78
g6   1.19,31.60,2.40,115, 0.0,33.3,58.7,82
ga1  1.1

# <font color=blue> 2- Reading data from a csv file </font>

* You can read data from a CSV file using the **`read_csv` ** function. By default, it assumes that the fields are comma-separated.
* it is important to know how the data is stored in the file. Here are some important things to look for:
    * Does the data has a header or not
    * How many columns are in the data, and what is the data type of each column
    * How the columns are formatted and separated, and whether it is fixed-width column formatting (each column  has predefined space)
    * what separator (delimiter) is used to separate the values in each row (eg. is it single space, multiple spaces, Tab, comma, ...) 
    * When there are missing values in some entries, what is the symbol used to represent that (NA, N/A, None, space,?, --, ... etc.)


* For our 1s data processing task with pandas, we will use **Stone Flakes** data set. This data set is located at the **UCI repository**

* The UCI Machine Learning Repository is a "go-to" place for fascinating data sets often used in machine learning education and research.

Let's check out the **Stone Flakes** data set.

* To download the StoneFlakes.dat file, we can use any of the following:
    * manually go to the UCI repository and download it
    * use OS commands from within the notebook (`!wget` for Linux, and mac. `!curl` for windows)
    * use pandas **`read_csv` **  with a **url** to the dataset


### <font color=green> 2.1- Read the dataset form local file</font>

In [5]:
notclean = pd.read_csv('StoneFlakes.dat')

In [6]:
notclean.shape

(79, 1)

In [8]:
notclean[:9]

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,ID LBI RTI WDI FLA PSF FSF ZDF1 PROZD
ar ?,35.3,2.6,?,42.4,24.2,47.1,69
arn 1.23,27.0,3.59,122,0.0,40.0,40.0,30
be 1.24,26.5,2.9,121,16.0,20.7,29.7,72
bi1 1.07,29.1,3.1,114,44.0,2.6,26.3,68
bi2 1.08,43.7,2.4,105,32.6,5.8,10.7,42
bie 1.39,29.5,2.78,126,14.0,0.0,50.0,78
bn 1.31,26.3,2.1,119,15.7,15.7,30.4,72
bo 1.27,27.6,3.5,116,16.8,23.0,35.2,69
by 1.11,32.6,2.9,113,15.8,15.8,15.0,57


### 2.2 <font color=green> Downloading the dataset using ``read_csv`` and URL </font>

Pandas `read_csv` allows you to read remote dataset directly into a dataframe using the dataset URL. That's you don't have to download the dataset to a local file first.

Example: In the following cell, we use the URL to the stoneFlakes dataset directly in the `read_csv` function to read the dataset remotely and then returns a pandas dataframe.
    * Note that the dataset must be available and exist at the location in the URL.

In [9]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00299/StoneFlakes.dat"
notclean = pd.read_csv(url)

In [10]:
notclean.shape

(79, 1)

In [11]:
notclean[:5]

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,ID LBI RTI WDI FLA PSF FSF ZDF1 PROZD
ar ?,35.3,2.6,?,42.4,24.2,47.1,69
arn 1.23,27.0,3.59,122,0.0,40.0,40.0,30
be 1.24,26.5,2.9,121,16.0,20.7,29.7,72
bi1 1.07,29.1,3.1,114,44.0,2.6,26.3,68
bi2 1.08,43.7,2.4,105,32.6,5.8,10.7,42


$\color{#97365A}{Note~}$ that the columns format of the dataframe above is not correct. We will see later why and how to fix that

## <font color=green> 3- Dealing with formating problems </font>
Usually datasets stored in files in a specific format, and not all files use tha same format. In our `stoneFlakes` dataset, we saw that `read_csv` couldn't parse the data correctly with out some extra information. So we need to look at the `stoneFlakes.dat` file and see what format is used to store this dataset.

When looking closely inside the `stoneFlakes.dat` file we see that the column values are separated by commas except for the first column.  Also we noticed that question mark symbol `?` is used wherever data value is missing.

How can we read such  file correctly? 

The [pandas package](http://pandas.pydata.org/) offers powerful options for parsing data.  Let's try the *read_csv* function optional parameters.

### <font color=#2e86c1> 3.1 - 1st Try: using `read_csv` default parameters (without any extra parameters)

In [12]:
notclean = pd.read_csv('StoneFlakes.dat')

### is it working? 
Let's check the columns (how many columns?, columns names?, columns types?)

Pandas dataframe has `columns` proerty which we can check (display) and set (change) 

In [13]:
notclean.columns

Index(['ID    LBI   RTI  WDI FLA  PSF  FSF ZDF1 PROZD'], dtype='object')

In [14]:
len(notclean.columns)

1

In [15]:
notclean.shape

(79, 1)

* to see some data from the begining and the end

In [16]:
notclean.head(n=7)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,ID LBI RTI WDI FLA PSF FSF ZDF1 PROZD
ar ?,35.3,2.6,?,42.4,24.2,47.1,69
arn 1.23,27.0,3.59,122,0.0,40.0,40.0,30
be 1.24,26.5,2.9,121,16.0,20.7,29.7,72
bi1 1.07,29.1,3.1,114,44.0,2.6,26.3,68
bi2 1.08,43.7,2.4,105,32.6,5.8,10.7,42
bie 1.39,29.5,2.78,126,14.0,0.0,50.0,78
bn 1.31,26.3,2.1,119,15.7,15.7,30.4,72


In [17]:
notclean.tail(n=5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,ID LBI RTI WDI FLA PSF FSF ZDF1 PROZD
wn 1.04,30.4,2.7,118,33.3,5.1,28.2,66
woe 1.14,31.9,2.07,121,15.0,10.0,5.0,47
wol 1.24,27.2,3.0,120,29.1,13.8,34.4,70
wst 1.15,38.0,2.25,125,9.5,6.4,13.0,88
z 1.37,26.0,3.22,125,12.6,21.5,28.6,61


### this doesn't look good
as we see from the `.shape` command, we read the file as only one column

### <font color=#2e86c1> 3.2 - 2nd Try : Using the `Sep` parameter to tell  `read_csv` what to consider as columns delimiter


In [18]:
notclean = pd.read_csv('StoneFlakes.dat',sep=',') 

let's look at the dataframe's shape

In [19]:
notclean.shape

(79, 1)

##### We still have the same issue! 

Actually `sep=','` is optional  because the default value for `sep` is `','`

### <font color=#2e86c1> 3.3 - Here is another try: Use multiple separators

Since our dataset uses more than one separator (comma and space),  we need to tell `read_csv` to use both 

One way to do that is to use `regex` (regular exprssions) for the separators. since we have some colums separated by space and some by comma, we will use a regex that represent any combination of commas and spaces as follows:

 _<font color=Red> sep='\[,\\s\]+' </font> or <font color=Red> sep='\[,\\s+\]+' </font>_  means any `comma` or any one or more `spaces`


* So, for the `read_csv()` function to properly parse the dataset, we need to give it a regular expression to represent all possible separators in this dataset .

The regular expression ( `sep = "[,\s+]+" `) is interpreted as follows:

The `"\s"` means space, and `“\s+”` means “One or More spaces”

The `“[ ]”` means “a set of”, and here `"[,\s+]"` means a set of comma and space(s)

The `“[ ]+”` means one or more of what’s in the set.

The regular expression ( `sep = "[,\s+]*" `) will be intrepreted as zero or more commas or spaces, i.e. both comma and spaces are optional, so it will not work with our `stoneFlakes.dat` dataset

   * Note the difference between `"*"` and `"+"`: the `"*"` means "Zero or more", so whatever precedes the `"*"` is optional, while `"+"` require at least One of what precedes it.

So, let's try it with the `sep = "[,\s+]+" `

In [2]:
notclean = pd.read_csv('StoneFlakes.dat',sep='[,\s]+',engine='python')

NameError: name 'pd' is not defined

In [23]:
notclean.shape

(79, 9)

We see that the number of columns is 9 now, which is correct. Let's verify that they are parsed correctly by displaying some rows 

In [24]:
notclean[:10]

Unnamed: 0,ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
0,ar,?,35.3,2.6,?,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105,32.6,5.8,10.7,42
5,bie,1.39,29.5,2.78,126,14.0,0.0,50.0,78
6,bn,1.31,26.3,2.1,119,15.7,15.7,30.4,72
7,bo,1.27,27.6,3.5,116,16.8,23.0,35.2,69
8,by,1.11,32.6,2.9,113,15.8,15.8,15.0,57
9,c,1.32,29.5,2.57,121,22.0,2.0,18.0,63


In [25]:
notclean.columns

Index(['ID', 'LBI', 'RTI', 'WDI', 'FLA', 'PSF', 'FSF', 'ZDF1', 'PROZD'], dtype='object')

#### Another Issue:

__In the output above, we see some rows have `?` symbols, which means missing values__

Python operators and commands cannot deal with these symbols as missing values, so we need to replace them with somthing that these commands can understand that these are missing. Numpy uses `np.nan` (__not a number__) to help address this issue. We will see later how to solve this in our dataset


Here are some examples on how python math operators work with nan values

In [26]:
x=np.nan
y=5
x+y

nan

In [27]:
sum([3,np.NaN])

nan

In [29]:
x = np.array([1,33,4,np.nan,0,-1])
x>0

  


array([ True,  True,  True, False, False, False])

In [30]:
x+2

array([ 3., 35.,  6., nan,  2.,  1.])

### <font color=#2e86c1> 3.4 - Replacing missing values symbols by `NaN`


*  We can create a list of possible missing values sympols, and use it as `na_values` parameter of ``read_csv``

* Since our missing values here are represented with "?" symbol  we can set the `na_values parameter` as `['?']` 
* If there are more than one symbol for missing values, just include them all in the list, e.g. `['?','NA','--','n/a']`

### Let the read_csv handle the missing values

To make `read_csv` replace the missing values by `Nan` :

In [32]:
mvs = ['?','NA','--','n/a']
notclean = pd.read_csv('StoneFlakes.dat' , na_values=mvs , sep='[,\s]+', engine='python')

Let's see if that worked

In [33]:
notclean[:5]

Unnamed: 0,ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
0,ar,,35.3,2.6,,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42


In [34]:
notclean.shape

(79, 9)


   
#### We can further process the file while reading to drop records with errors




**error_bad_lines=False** : drop rows with errors (e.g. too many commas) instead of reporting errors

In [35]:
mvs = ['?','NA','--','n/a']
notclean = pd.read_csv(open('StoneFlakes.dat'),na_values=mvs,sep='[,\s]+',engine='python',error_bad_lines=False)
notclean[:5]

Unnamed: 0,ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
0,ar,,35.3,2.6,,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42


In [36]:
notclean.shape

(79, 9)

### More on Pandas read commands
#### You can get more information about other Pandas read from files commands (fixed width text files, Jason,Excel,HTML,...) go to Pandas documentaions at:
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output

## <font color=green> 4-  The dataframe components

We can get the pandas dataframe components using the follwoing attributes, `index`, `columns`, and `values`

<img src="DataFrameComponents.png" width="550" >

In [37]:
index = notclean.index
columns = notclean.columns
values = notclean.values

In [38]:
index

RangeIndex(start=0, stop=79, step=1)

In [39]:
columns

Index(['ID', 'LBI', 'RTI', 'WDI', 'FLA', 'PSF', 'FSF', 'ZDF1', 'PROZD'], dtype='object')

In [40]:
len(columns)

9

In [41]:
values

array([['ar', nan, 35.3, 2.6, nan, 42.4, 24.2, 47.1, 69],
       ['arn', 1.23, 27.0, 3.59, 122.0, 0.0, 40.0, 40.0, 30],
       ['be', 1.24, 26.5, 2.9, 121.0, 16.0, 20.7, 29.7, 72],
       ['bi1', 1.07, 29.1, 3.1, 114.0, 44.0, 2.6, 26.3, 68],
       ['bi2', 1.08, 43.7, 2.4, 105.0, 32.6, 5.8, 10.7, 42],
       ['bie', 1.39, 29.5, 2.78, 126.0, 14.0, 0.0, 50.0, 78],
       ['bn', 1.31, 26.3, 2.1, 119.0, 15.7, 15.7, 30.4, 72],
       ['bo', 1.27, 27.6, 3.5, 116.0, 16.8, 23.0, 35.2, 69],
       ['by', 1.11, 32.6, 2.9, 113.0, 15.8, 15.8, 15.0, 57],
       ['c', 1.32, 29.5, 2.57, 121.0, 22.0, 2.0, 18.0, 63],
       ['cl', 1.16, 33.4, 2.3, 131.0, 7.5, 14.9, 6.0, 60],
       ['d', 1.23, 27.6, 2.83, 121.0, 27.9, 6.7, 31.7, 67],
       ['e1', 1.24, 25.5, 3.6, 113.0, 5.3, 21.1, 60.0, 86],
       ['e2', 1.2, 27.7, 3.4, 108.0, 37.4, 9.9, 39.9, 71],
       ['ey', 1.33, 29.4, 2.3, 120.0, 13.5, 6.9, 38.3, 72],
       ['fli', 1.11, 35.1, 3.51, 123.0, 67.2, 2.5, 5.9, 46],
       ['g10', 1.14, 29.9, 2.9, 1

### Note: 
#### if the dataset file didn't contain column headers use the parameter ``header=None`` in the ``read_csv``, otherwise the first row  of data will be considered as the header. 
#### To assign new column names to the dataframe, then use ``.columns ``  property  of the Pandas dataframe  using array-like list

In [56]:
notclean.columns

Index(['ID', 'LBI', 'RTI', 'WDI', 'FLA', 'PSF', 'FSF', 'ZDF1', 'PROZD'], dtype='object')

In [42]:
 notclean.columns = ["A", "B", "C", "D","E","F","G","H","I"]

In [43]:
notclean[:5]

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,ar,,35.3,2.6,,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42


In [44]:
# put the column lables back
notclean.columns = columns

In [45]:
notclean[:5]

Unnamed: 0,ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
0,ar,,35.3,2.6,,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42


## <font color=green > 5- Saving to a CSV </font>

 let's save our dataframe for later use!

In [50]:
notclean.to_csv('processedStoneFlake.csv', index=False)

In [51]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is B2B3-5C73

 Directory of C:\SP2021\Visualization\Lectures\W04

01/29/2021  12:22 PM    <DIR>          .
01/29/2021  12:22 PM    <DIR>          ..
01/29/2021  11:44 AM    <DIR>          .ipynb_checkpoints
01/28/2021  10:14 PM            31,709 DataFrameComponents.png
01/28/2021  09:09 PM           447,184 L04-part1.pptx
01/29/2021  12:24 PM             3,299 processedStoneFlake.csv
01/29/2021  11:48 AM             3,523 StoneFlakes.dat
01/29/2021  12:22 PM            73,185 W04-Pandas.ipynb
01/28/2021  09:07 PM           383,978 W04-part1.pdf
               6 File(s)        942,878 bytes
               3 Dir(s)  28,193,800,192 bytes free


In [52]:
!more processedStoneFlake.csv

ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
ar,,35.3,2.6,,42.4,24.2,47.1,69
arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42
bie,1.39,29.5,2.78,126.0,14.0,0.0,50.0,78
bn,1.31,26.3,2.1,119.0,15.7,15.7,30.4,72
bo,1.27,27.6,3.5,116.0,16.8,23.0,35.2,69
by,1.11,32.6,2.9,113.0,15.8,15.8,15.0,57
c,1.32,29.5,2.57,121.0,22.0,2.0,18.0,63
cl,1.16,33.4,2.3,131.0,7.5,14.9,6.0,60
d,1.23,27.6,2.83,121.0,27.9,6.7,31.7,67
e1,1.24,25.5,3.6,113.0,5.3,21.1,60.0,86
e2,1.2,27.7,3.4,108.0,37.4,9.9,39.9,71
ey,1.33,29.4,2.3,120.0,13.5,6.9,38.3,72
fli,1.11,35.1,3.51,123.0,67.2,2.5,5.9,46
g10,1.14,29.9,2.9,116.0,5.3,55.3,63.2,84
g11,1.16,27.9,3.0,116.0,3.5,23.7,64.8,85
g2,1.22,29.5,3.0,116.0,6.0,39.7,50.2,78
g4,1.23,29.0,2.9,116.0,8.0,24.0,60.0,87
g5,1.25,30.2,2.8,118.0,7.5,37.7,43.9,78
g6,1.19,31.6,2.4,115.0,0.0,33.3,58.7,82
ga1,1.11,21.8,3.6,112.0,2.5,22.7,92.3,98
ga2,1.22,34.2,2.6,,46.3,0.0,63.6,82
go

In [53]:
nn = pd.read_csv('processedStoneFlake.csv')
nn.shape

(79, 9)

In [54]:
nn[:5]

Unnamed: 0,ID,LBI,RTI,WDI,FLA,PSF,FSF,ZDF1,PROZD
0,ar,,35.3,2.6,,42.4,24.2,47.1,69
1,arn,1.23,27.0,3.59,122.0,0.0,40.0,40.0,30
2,be,1.24,26.5,2.9,121.0,16.0,20.7,29.7,72
3,bi1,1.07,29.1,3.1,114.0,44.0,2.6,26.3,68
4,bi2,1.08,43.7,2.4,105.0,32.6,5.8,10.7,42


# <font color=red> Learning Activity </font>

* Search for a data set with missing data (you can use the UCI repository above)
* download the dataset using (1) the OS command, (2) using read_csv
* what symbol(s) are used for missing values?
* read the dataset into pandas frame
* display the header of the list (the column names). 
    * what if the dataset doesn't have headers? (see the ``read_csv`` documentation)
* replace the missing values by NaN in the dataframe
* save the new dataset into a new csv file