<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/13_Data_Types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Types

## Lots of Data

To get onto the task of cleaning our data, it is first important to know what type of data we have and the best tools for using it!

Let's first look at how we bring our data into the python environment.  We have worked mostly with `pandas` and will continue to do so!  Pandas is actually built on top of another environment, `numpy`.  At some point in our work we will need both!

### Pandas

Pandas is great for loading data.  We have seen it handle csv, html and data from an sql call.  We can also load JSON and excel files.

`DataFrame` is the table environment we've used before and `series` is similar to a column.

You should use a pandas dataframe when your data contains categorical data.

Pandas is best when dealing with large datasets.

In [1]:
import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/iris.csv')

df.head()

Unnamed: 0,SepalLength,SepalWidth,PedalLength,PedalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Numpy

The other important tool in python is `numpy`.  It is the foundation of the pandas module but it has some limitations.  

Numpy is excellent for higher dimensional data, stored as an `array`.  Think multiple sheets in a excel spreadsheet, data that will not simply fit in a 2 dimensional array.

Numpy arrays can be accessed easily by there indicies, this is very ineffiecient in pandas.

Numpy data should be just numbers!  Categorical data should be converted first before utilizing a numpy array. Numpy is effiecient and fast but works best on smaller datasets.


In [2]:
import numpy as np

df1 = pa.get_dummies(data = df)

X = np.array(df1)

X

array([[5.1, 3.5, 1.4, ..., 1. , 0. , 0. ],
       [4.9, 3. , 1.4, ..., 1. , 0. , 0. ],
       [4.7, 3.2, 1.3, ..., 1. , 0. , 0. ],
       ...,
       [6.5, 3. , 5.2, ..., 0. , 0. , 1. ],
       [6.2, 3.4, 5.4, ..., 0. , 0. , 1. ],
       [5.9, 3. , 5.1, ..., 0. , 0. , 1. ]])

We will discuss what the `get_dummies` does in due time.  For now just know that it converted the class into numbers for use in numpy array.

### Which is Best?

Very often I will use both in a project.  I'll start with pandas for loading, cleaning and basic analysis.  Then I will convert the data to an numpy array and create models for predicitons.

## Less Data

Now that we have lots of data we'll have to start examining each piece!

### Strings

The most common type of data we examine is a string.  We will spend a lot of time dealing with strings.  Often data in another format is actually given as a string so we'll have our work cut out for use manipulating strings.  

In the `iris` dataset, the class was given as a string.

In [5]:
df.Class.iloc[0]

'Iris-setosa'

The tell-tale sign of a string is the quotes.  Of course we can save a string too.

In [6]:
a_string = 'My really cool string'

a_string

'My really cool string'

### Boolean

Boolean is the logical operator, taking only two values, `True` or `False`.  We can combine them using the normal logical connections.  We can also get a boolean by doing comparisons.

In [14]:
a = True
b = False

print(a and b) 

print(a or b)

print( not a ) 

False
True
-1


In [9]:
print(3 == 4)

print(5>-2)

False
True


In [24]:
bool(0)

False

This may show up in manipulating data!  You can ask for only the classes that are *Iris-setosa* in your dataset

In [15]:
df.Class == 'Iris-setosa'

0       True
1       True
2       True
3       True
4       True
       ...  
145    False
146    False
147    False
148    False
149    False
Name: Class, Length: 150, dtype: bool

Then you can pass that back into the dataframe and it will only give you the entries that were true.

In [17]:
df[df.Class == 'Iris-setosa'].head(10) #I've added head to limit the output to 10 entries

Unnamed: 0,SepalLength,SepalWidth,PedalLength,PedalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


### Integers

Integers are whole numbers that can be positive or negative.  Integers are closed under addition, subtration and multiplication (Not division).  Using integers saves some memory so if your entry is an integer you should use it that way.

Some examples of integers are customer numbers and counts of objects.  The code to convert to an integer is `int`. 

In [18]:
int(-3.0000)

-3

### Floats

A float is a generic number stored up to a certain number of decimals (64 bits in pandas).  Be wary of the last few decimals, more if you have done lots of computations.

In the `iris` dataset most columns are floats.

### Daytime

1. int
2. float
3. string
4. factor
5. bool
6. daytime