### Load and Verify Data

#### Load the Seeds Data Set from a URL

You will begin the Case Study by loading the data directly from a URL. The data is hosted in a whitespace-separated file on the UCI Machine Learning Repository<sup>data</sup>

data: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt 


#### Programming Method: Importing a Python Module as Alias

The first Python code you will run in this Case Study is code importing libraries necessary for the data science worked performed in this case study. Each of the `import` statements you will run is of the following format:

<include type="listing" label="import as alias">

###### Importing a Python Module as Alias

```
import full_library_name as short_library_name
```

</include>

This has the effect of importing the entire library to be used under an alias. For example, you will import `numpy` using the import statement

<include type="listing" label="import_numpy_with_alias">

###### Importing `numpy` with an alias

```
import numpy as np
```

</include>

This imports the entire `numpy` library, but the library but the library will only be available under the shortened alias `np`. This is a common technique in Python programming and useful when we will be making frequent reference to the library.

##### Import the Python Numerical Stack

The entire Case Study is performed using the Python Numerical Stack. You first load the libraries that comprise the Python numerical stack, including

- `matplotlib`
- `numpy`
- `pandas`
- `seaborn`. 

In [1]:
###### Import the Python Numerical Stack

import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd 
import seaborn as sns

##### Run the IPython Magic Command `%matplotlib inline`

You make use of the IPython Magic<sup>magics</sup> command `matplotlib` to configure the notebook for interactive `matplotlib` work. The `inline` argument specifies that we wish to work interactively with a Jupyter Notebook. Note that as with other IPython Magic commands, we begin this line of code with the `%` symbol. The result is that any images generated by our Python code will be rendered within the notebook while you work.

magics: https://ipython.readthedocs.io/en/stable/interactive/magics.html


In [2]:
###### Render images inline with IPython magic

%matplotlib inline

##### Define the Dataset URL

Next, you define a variable to hold the URL of the data set as a string. Note that you give the variable an all caps name. While this has no semantic meaning in Python, you do this to signify that this string should be thought of as a constant, that is that it is a value to be stored by the notebook that will never change.

In [3]:
###### Define the Dataset URL

SEEDS_URL = ('https://archive.ics.uci.edu/' +
             'ml/machine-learning-databases/00236/seeds_dataset.txt')

You display the string stored as `SEEDS_URL` to verify that it matches the expected value. 

In [4]:
###### Display the Dataset URL

SEEDS_URL

'https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'

##### Display a sample of the Data Set URL

It is worth entering the `SEEDS_URL` into a browser, both to verify that the URL matches the correct URL, but also to get a preliminary sense of what the data should look like. Copy the value stored as `SEEDS_URL` and paste it into the address bar of any browser. If the URL is correct, you will see the data stored at the UCI Machine Learning Repository in your browser window as plain text. The first five rows of the data are as follows:

<include type="listing" label="sample_seeds_dataset">

###### A Sample of the Seeds dataset text file

```
15.26    14.84    0.871     5.763    3.312    2.221    5.22     1
14.88    14.57    0.8811    5.554    3.333    1.018    4.956    1
14.29    14.09    0.905     5.291    3.337    2.699    4.825    1
13.84    13.94    0.8955    5.324    3.379    2.259    4.805    1
16.14    14.99    0.9034    5.658    3.562    1.355    5.175    1
...
```

</include>

##### Use `pd.read_csv()` to load data as `DataFrame`

Next, use the `pd.read_csv()` method available as part of the Pandas library to  read the text file stored at that URL into a `DataFrame`. 

In [5]:
###### Use `pd.read_csv()` to load data as `DataFrame`

seeds_df = pd.read_csv(SEEDS_URL, header=None, sep='\s+')

A bit of special handling is required here. First, it is necessary to specify on load that the file has no header row. This is done using the argument `header=None`. Second, it is necessary to specify that this data is whitespace-separated. In other words, whitespace is used to separate the values in a row of data rather than the more conventional commas. This is done using the regular expression `\s+`. This argument signifies that one or more whitespace character(s) should be used as separator.

##### Manually Define Column Names

Because no header row is specified in the text file, it is necessary to manually specify the names of each column in your new `DataFrame`. These names were obtained from the attribute information section at the UCI Machine Learning Repository. Note that you make all of the names computer-friendly by replacing whitespace with underscores.

The names of the columns of a `DataFrame` are stored as the attribute `df.columns`. You specify these names manually by assigning a list of the correct length to this attribute.

In [6]:
###### Manually Define Column Names

seeds_df.columns = [
    "area",
    "perimeter",
    "compactness",
    "length_of_kernel",
    "width_of_kernel",
    "asymmetry_coefficient ",
    "length_of_kernel_groove",
    "seed_class"
]

##### Basic Data Verification

As a means of verifying that the data is properly loaded, you display both the type and the shape of the `DataFrame` and the first five rows. This is done using the `type()` function, the `.shape` attribute, and the `.head()` method, respectively. It is worth noting that `.shape` is an attribute of and `.head()` is a method of the `Pandas.DataFrame` class, while `type()` is a Python built-in function.

In [7]:
###### Show the type of `seeds_df`

type(seeds_df)

pandas.core.frame.DataFrame

In [8]:
###### Show the shape of `seeds_df`

seeds_df.shape

(210, 8)

As expected, `seeds_df` is a Pandas `DataFrame`. It has a shape of `(210, 8)`. This corresponds to a $n = 210$ and $p=7$, plus 1 target column. 

In [9]:
###### Show the `.head()` for a few columns of `seeds_df`

seeds_df[
    ['area', 'perimeter', 
     'compactness', 
     'length_of_kernel']
].head()

Unnamed: 0,area,perimeter,compactness,length_of_kernel
0,15.26,14.84,0.871,5.763
1,14.88,14.57,0.8811,5.554
2,14.29,14.09,0.905,5.291
3,13.84,13.94,0.8955,5.324
4,16.14,14.99,0.9034,5.658


After running `seeds_df.head()` make sure to compare the first five rows of the loaded `DataFrame` to the first five rows of the plain text file in your browser.

#### Verify Type Casting

During the loading process Pandas will explicitly cast each column into the most appropriate data type<sup>infertype</sup>. Based on information available at the UCI machine learning repository page for this data set, it is expected that the first seven columns to be floating-point, that is type `float`, and the eighth target column, `seed_class` to be categorical.

infertype: https://rushter.com/blog/pandas-data-type-inference/


##### Display `DataFrame.dtypes`

Next, you use the `.dtypes` attribute to display the data types of your `seeds_df DataFrame`.

In [10]:
###### Display `DataFrame.dtypes`

seeds_df.dtypes

area                       float64
perimeter                  float64
compactness                float64
length_of_kernel           float64
width_of_kernel            float64
asymmetry_coefficient      float64
length_of_kernel_groove    float64
seed_class                   int64
dtype: object

Note that the first seven columns have been correctly cast, but `seed_class` has been incorrectly designated as type `int`. This is no doubt due to the fact that the `seed_class` categories are encoded numerically.

##### Display Unique Values for the `seed_class` Column

Using the `.unique()` method, you display the unique values of the `seed_class` column. 

In [11]:
###### Display Unique Values for the `seed_class` Column

seeds_df.seed_class.unique()

array([1, 2, 3])

Indeed, the seed classes, Kama, Rosa and Canadian, have been encoded as the numbers 1, 2, and 3. As all of the values are integers, Pandas incorrectly intuited that the column should be of type `int`.

##### Programming Method: Transform and Reassign a Variable

The programming pattern being used here is one that is fairly common. Here we take some variable, transform that variable, and then reassign the result to the original variable name. In the simplest sense, you might think about doing this with an integer

<include type="listing" label="transform_and_reassign">

###### Transform and Reassign `my_int`

```
my_int = 4
my_int = my_int + 5
```

</include>

Of course, this pattern is so common that it has a shortcut that works in most popular programming languages

<include type="listing" label="simple_incrementation">

###### Simple Incremenation of `my_int`

```
my_int = 4
my_int += 5
```

</include>

In general, where it is not necessary to keep incremental results of a variable, we use a pattern like this one. For example, you might imagine recording the time that a certain process takes using the `time.time()` function. 

<include type="listing" label="time_a_process">

###### Time a Process

```
import time
start = time.time()
## execute some process
elapsed_time = time.time() - start
```

</include>

Note that `elapsed_time` is a number of seconds stored as a `float`, while you may only be interested in the number of seconds as an `int`. A wasteful method would be 

<include type="listing" label="cast_as_int">

###### Cast `elapsed_time` as an integer

```
elapsed_time_int = int(elapsed_time)
```

</include>

Using this, you have actually stored two variables `elapsed_time` and `elapsed_time_int`. When transforming and reassigning, you simply overwrite the first variable with the change you wish to make

<include type="listing" label="cast_and_reassign">

###### Cast `elapsed_time` as an integer and reassign to `my_int`

```
elapsed_time = int(elapsed_time)
```

</include>

Here you have made use of the transformation and reassignment programming pattern. We cast the time `elapsed_time` as an `int` and reassign this to the original `elapsed_time`, thus only ever storing a single value.

$\square$ **Note**: It is worth noting that this is not always seen as a best practice, and in many cases it makes more sense to not transform your variables after you have assigned them. The practice is so common in imperative programming (as opposed to functional), that it is worth familiarizing yourself with it, even if you do not intend to handle your variables in this way.

#### Fix Typecasting Error

You will need to correct the incorrect typecasting manually. You will do this using the `.astype()` method. This method, applied to a column, returns a copy of that column cast into the requested data type. To complete the casting, you save this typecast copy back as the original column, using the transformation and reassignment programming pattern. This pattern is:

<include type="listing" label="recast_pandas_series_type">

###### Recast the type of a Pandas Series

```
df.my_column = df.my_column.astype(<REQUESTEDTYPE>)
``` 

</include>

##### Cast `seed_class` as a Pandas Column of type `category`

Here, you cast `seed_class` as the Pandas data type, `category`\footnote{https://pandas.pydata.org/pandas-docs/stable/categorical.html}.

In [12]:
###### Cast `seed_class` as a Pandas Column of type `category`

seeds_df.seed_class = seeds_df.seed_class.astype('category')

Once more, you display the data types of your `seeds_df DataFrame`. 

You now see that the target column, `seed_class` is now correctly encoded as categorical.

In [13]:
###### Display `DataFrame.dtypes`

seeds_df.dtypes

area                        float64
perimeter                   float64
compactness                 float64
length_of_kernel            float64
width_of_kernel             float64
asymmetry_coefficient       float64
length_of_kernel_groove     float64
seed_class                 category
dtype: object

You now see that the target column, `seed_class` is correctly encoded as a `category`.

#### Save Data Set as Local Files

So that you do not need to retrieve your data set from remote URL each time you wish work with it, you next use the Pandas `DataFrame` methods, `.to_csv()` and `.to_pickle()`, to store local versions the downloaded file. Each of these takes a string as its argument. This string will be used as the name of the file to be saved locally.

##### Programming Method: The Python Pickle Format

While you are no doubt familiar with the CSV format, this may be the first time you are learning about the Pickle format<sup>pickle</sup>. The Pickle format is a Python specific way to save information. When you write a `DataFrame` to disk as a Pickle you can load it in later, *exactly* as it was when you saved it. Furthermore, just like Pandas has a `pd.read_csv()` method, it also includes the method `pd.read_pickle()` to help you load a Pickled `DataFrame` at some later point.

pickle: https://docs.python.org/3/library/pickle.html


In [14]:
###### Save the `DataFrame` as a CSV

seeds_df.to_csv('seeds.csv')

In [15]:
###### Save the `DataFrame` as a pickle

seeds_df.to_pickle('seeds.p')