## Load and Verify Data

We will load the data directly from a URL. The data is hosted in a whitespace-separated file on the UCI Machine Learning Repository. This file is available at the following URL: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. We first define a variable to hold this URL as a string. Note that I have named this variable using all caps. While this has no semantic meaning in Python, I have done this to signify that this string should be thought of as a constant. It is a value to be stored by the notebook that will never change.

### Load the Seeds Data Set from a URL

#### Import the Python Numerical Stack

We begin by loading the libraries that comprise the Python numerical stack, including

- `matplotlib`
- `numpy`
- `pandas`
- `seaborn`. 

Note that we have includes the IPython magic command `%matplotlib inline`. This will ensure that any images generated will be rendered within the notebook while we work.

#### Import Entire Libraries Versus Specific Classes or Modules

LOAD ALL LIBS versus adhoc with sklearn

In [None]:
import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import seaborn as sns

%matplotlib inline

#### Define the Data Set URL

In [None]:
UCI_BASE_URL = 'https://archive.ics.uci.edu/'
ML_REPO_URI = 'ml/machine-learning-databases/00236/seeds_dataset.txt'
SEEDS_URL = UCI_BASE_URL + ML_REPO_URI

In [None]:
SEEDS_URL

#### Display a sample of the Data Set URL

obtained by entering url into browser
```
15.26	14.84	0.871	5.763	3.312	2.221	5.22	1
14.88	14.57	0.8811	5.554	3.333	1.018	4.956	1
14.29	14.09	0.905	5.291	3.337	2.699	4.825	1
13.84	13.94	0.8955	5.324	3.379	2.259	4.805	1
16.14	14.99	0.9034	5.658	3.562	1.355	5.175	1
14.38	14.21	0.8951	5.386	3.312	2.462	4.956	1
14.69	14.49	0.8799	5.563	3.259	3.586	5.219	1
14.11	14.1	0.8911	5.42	3.302	2.7		5		1
16.63	15.46	0.8747	6.053	3.465	2.04	5.877	1
16.44	15.25	0.888	5.884	3.505	1.969	5.533	1
15.26	14.85	0.8696	5.714	3.242	4.543	5.314	1
14.03	14.16	0.8796	5.438	3.201	1.717	5.001	1
...
```

#### Use `pd.read_csv()` to load data as `DataFrame`

Next, I use the `pd.read_csv()` method available as part of the Pandas library to to read the file into a `DataFrame`. A bit of special handling was required here. First, it is necessary to specify on load that the file has no header row. This is done using the argument `header=None`. Second, it is necessary to specify that this data is whitespace-separated. In other words, whitespace is used to separate the values in a row of data rather than the more conventional commas. This is done using the regular expression (regex) `\s+`. This argument signifies that one or more whitespace character(s) should be used as separator.

You can read more about regex here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions. 

In [None]:
seeds_df = pd.read_csv(SEEDS_URL, header=None, sep='\s+')

#### Manually Define Column Names

Because no header row was specified it is necessary for us to manually specify the names of each column in our new `DataFrame`. These names were obtained from the attribute information section at the UCI Machine Learning Repository. Note that we have made all of the names computer-friendly by replacing whitespace with underscores.

The names of the columns of a `DataFrame` are stored as the attribute `df.columns`. We can specify these names manually by assigning a list of the correct length to this attribute.

In [None]:
seeds_df.columns = [
    "area",
    "perimeter",
    "compactness",
    "length_of_kernel",
    "width_of_kernel",
    "asymmetry_coefficient ",
    "length_of_kernel_groove",
    "seed_class" ]

#### Basic Data Verification

As a means of verifying that the data was properly loaded, we display both the type and the shape of the `DataFrame` and the first five rows. This is done using the `type()` function, the `.shape` attribute, and the `.head()` method respectively. It is worth noting that `.shape` is an attribute and `.head()` is a method of the `Pandas.DataFrame` class, while `type()` is a Python builtin function.

As expected, `seeds_df` is a Pandas `DataFrame`. It has a shape of `(210, 8)`. This corresponds to a $n = 210$ and $p=7$, plus 1 target column. Looking at the first five rows of the `DataFrame`, it appears that everything has been loaded correctly.

In [None]:
type(seeds_df)

In [None]:
seeds_df.shape

In [None]:
seeds_df.head()

Compare this to the sample we looked at above 

### Verify Type Casting

How pandas does type casting: https://rushter.com/blog/pandas-data-type-inference/









display the defaults of `pd.read_csv()`

show that in infers data type

During the loading process Pandas will explicitly cast each column into the most appropriate data type. Based on information available at the UCI machine learning repository page for this data set, we would expect the first seven columns to be floating-point, that is type `float`, and the eighth target column, `seed_class` to be categorical. Here we use the `.dtypes` attribute to display the data types of our `seeds_df` `DataFrame`.

We note that the first seven columns have been correctly cast, but `seed_class` has been incorrectly designated as type `int`. This is no doubt due to the fact that the `seed_class` categories are encoded numerically.

#### Display `DataFrame.dtypes`

In [None]:
seeds_df.dtypes

#### Display Unique Values for the `seed_class` Column

Using the `.unique()` method, we display the unique values of the `seed_class` column. Indeed, the seed classes, Kama, Rosa and Canadian, have been encoded as the numbers 1, 2, and 3. As all of the values are integers, Pandas incorrectly intuited that the column should be of type `int`.

In [None]:
seeds_df.seed_class.unique()

#### Discussion: Incorrect Typecasting

We will need to correct this manually. We do so using the `.astype()` method. This method, applied to a column, returns a copy of that column cast into the requested data type. To complete the casting, we save this type-cast copy back as the original column.

This pattern resembles this:

    df.my_column = df.my_column.astype(#REQUESTEDTYPE#)
    

#### Programming Method: Transform and Reassign a Variable

The programming pattern being used here is one that is fairly common. Here we take some variable, transform that variable, and then reassign the result to the original variable name. In the simplest sense, you might think about doing this with an integer

    my_int = 4
    my_int = my_int + 5
    
Of course, this pattern is so common that it has a shortcut that works in most popular programming languages

    my_int = 4
    my_int += 5
    
In general, where it is not necessary to keep incremental results of a variable, we use a pattern like this one. You might imagine doing the same in order to transform a Python `list` to a `numpy` array. 

    xx = [1,2,3]
    xx = np.array(xx)
    
Here we have made use of the transformation and reassignment programming pattern. We cast the list `xx` as a `np.array` and reassign this to the original `np.array`.

### Fix Typecasting Error

#### Cast `seed_class` as a Pandas Column of type `category`

Here, we cast `seed_class` as the Pandas data type, category\footnote{https://pandas.pydata.org/pandas-docs/stable/categorical.html}.

In [None]:
seeds_df.seed_class = seeds_df.seed_class.astype('category')

Once more, we display the data types of our `seeds_df` `DataFrame`. We now see that the target column, `seed_class` is now correctly encoded as categorical.

In [None]:
seeds_df.dtypes

### Save Data Set as Local Files

So that we do not need to retrieve our data set from remote URL each time we work with it, we use the Pandas `DataFrame` methods, `.to_csv()` and `.to_pickle()`, to store local versions the downloaded file. Each of these takes a string as its argument. This string will be used as the name of the file to be saved locally.

#### Programming Method: The Python Pickle Format

While you are no doubt familiar with the CSV format, this may be the first time you are learning about the Pickle format. You can read more about this format here: https://docs.python.org/3/library/pickle.html. The Pickle format is a Python specific way to save information. When you write a `DataFrame` to disk as a Pickle you can load it in later, *exactly* as it was when you saved it. Furthermore, just like Pandas as the `pd.read_csv()` method, it also includes the method `pd.read_pickle()` to help you load a Pickled `DataFrame` at some later point.

In [None]:
seeds_df.to_csv('seeds.csv')

In [None]:
seeds_df.to_pickle('seeds.p')