# Data management using Pandas

**Data management** is a crucial component to statistical analysis and data science work. 

This notebook will show you how to import, view, undertand, and manage your data using the [Pandas](http://pandas.pydata.org) data processing library, i.e., the notebook will demonstrates how to read a dataset into Python, and obtain a basic understanding of its content.

Note that **Python** by itself is a general-purpose programming language and does not provide high-level data processing capabilities.  The **Pandas** library was developed to meet this need. **Pandas** is the most popular Python library for data manipulation, and we will use it extensively in this course. **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.

The main data structure that **Pandas** works with is called a **Data Frame**. This is a two-dimensional table of data in which the rows typically
represent cases and the columns represent variables (e.g. data used in this tutorial).  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named `read_xxx` for reading data in different formats.  Right now we will focus on reading `csv` files, which stands for comma-separated values. However the other file formats include `excel`, `json`, and `sql`. 

There are many other options to `read_csv` that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for `read_csv`.


## Acknowledgments

- The dataset used in this tutorial is from https://www.coursera.org/ from the course "Understanding and Visualizing Data with Python" by University of Michigan


# Importing libraries


In [2]:
# Import the packages that we will be using
import pandas as pd
import seaborn as sns

# Importing data

In [3]:
# Dataset url
ruta = "datasets/iris/irisFinal.csv"

# Load the dataset
df = pd.read_csv(ruta)

If we want to print the information about th output object type we would simply type the following: type(df)

In [4]:
df

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


# Exploring the content of the data set

Use the `shape` method to determine the numbers of rows and columns in a data frame. This can be used to confirm that we have actually obtained the data the we are expecting.

Based on what we see below, the data set being read here has $N_r$ rows, corresponding to $N_r$ observations, and $N_c$ columns, corresponding to $N_c$ variables in this particular data file.

In [5]:
NL, NC = df.shape
print(NC)
print(NL)

5
150


If we want to show the entire data frame we would simply write the following:

As you can see, we have a 2-Dimensional object where each row is an independent observation and each coloum is a variable.

Now, use the the `head()` function to show the first 5 rows of our data frame

In [6]:
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Also, you can use the the `tail()` function to show the last 5 rows of our data frame

In [7]:
df.tail()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


The columns in a Pandas data frame have names, to see the names, use the `columns` method:

To gather more information regarding the data, we can view the column names with the following function:

In [8]:
df.columns

Index(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Class'], dtype='object')

Be aware that every variable in a Pandas data frame has a data type.  There are many different data types, but most commonly you will encounter floating point values (real numbers), integers, strings (text), and date/time values.  When Pandas reads a text/csv file, it guesses the data types based on what it sees in the first few rows of the data file.  Usually it selects an appropriate type, but occasionally it does not.  To confirm that the data types are consistent with what the variables represent, inspect the `dtypes` attribute of the data frame.

In [9]:
df.dtypes

SepalLength    float64
SepalWidth     float64
PetalLength    float64
PetalWidth     float64
Class           object
dtype: object

Summary statistics, which include things like the mean, min, and max of the data, can be useful to get a feel for how large some of the variables are and what variables may be the most important. 

In [10]:
# Summary statistics for the quantitative variables
df.describe()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


It is also possible to get statistics on the entire data frame or a column as follows

- `df.mean()` Returns the mean of all columns
- `df.corr()` Returns the correlation between columns in a data frame
- `df.count()` Returns the number of non-null values in each data frame column
- `df.max()` Returns the highest value in each column
- `df.min()` Returns the lowest value in each column
- `df.median()` Returns the median of each column
- `df.std()` Returns the standard deviation of each column

In [12]:
df.mean()

SepalLength    5.843333
SepalWidth     3.057333
PetalLength    3.758000
PetalWidth     1.199333
dtype: float64

# How to write a data frame to a File

To save a file with your data simply use the `to_csv` attribute

Examples:
- df.to_csv('myDataFrame.csv')
- df.to_csv('myDataFrame.csv', sep='\t')

In [13]:
df.to_csv('myDataFrameIris.csv')

# Get unique existing values

List unique values in the one of the columns

df.Gender.unique()


In [14]:
# List unique values in the df['Gender'] column
df.Class.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

# Data Cleaning: handle with missing data

Before getting started to work with your data, it's a good practice to observe it thoroughly to identify missing values and handle them accordingly.

When reading a dataset using Pandas, there is a set of values including 'NA', 'NULL', and 'NaN' that are taken by default to represent a missing value.  The full list of default missing value codes is in the '`read_csv`' documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).  This document also explains how to change the way that '`read_csv`' decides whether a variable's value is missing.

Pandas has functions called `isnull` and `notnull` that can be used to identify where the missing and non-missing values are located in a data frame.  

Below we use these functions to count the number of missing and non-missing values in each variable of the datasetr.

In [18]:
# Buscar por valores nulos
df.isnull().sum()
print("")

SepalLength    0
SepalWidth     0
PetalLength    0
PetalWidth     0
Class          0
dtype: int64

# 3.  Create a new dataset containing only the petal width and length and the type of Flower Create a new dataset containing only the petal width and length and the type of Flower

In [28]:
# Nuevo dataset
newData = df[["PetalWidth", "PetalLength", "Class"]]
print(newData)

     PetalWidth  PetalLength           Class
0           0.2          1.4     Iris-setosa
1           0.2          1.4     Iris-setosa
2           0.2          1.3     Iris-setosa
3           0.2          1.5     Iris-setosa
4           0.2          1.4     Iris-setosa
..          ...          ...             ...
145         2.3          5.2  Iris-virginica
146         1.9          5.0  Iris-virginica
147         2.0          5.2  Iris-virginica
148         2.3          5.4  Iris-virginica
149         1.8          5.1  Iris-virginica

[150 rows x 3 columns]


# 4. Create a new dataset containing only the setal width and length and the type of Flower

In [29]:
# Nuevo dataset
newData2 = df[["SepalWidth", "SepalLength", "Class"]]
print(newData)

     PetalWidth  PetalLength           Class
0           0.2          1.4     Iris-setosa
1           0.2          1.4     Iris-setosa
2           0.2          1.3     Iris-setosa
3           0.2          1.5     Iris-setosa
4           0.2          1.4     Iris-setosa
..          ...          ...             ...
145         2.3          5.2  Iris-virginica
146         1.9          5.0  Iris-virginica
147         2.0          5.2  Iris-virginica
148         2.3          5.4  Iris-virginica
149         1.8          5.1  Iris-virginica

[150 rows x 3 columns]


# 5. Create a new dataset containing the setal width and length and the type of Flower encoded as a categorical numerical column 

In [40]:
# Nuevo dataset
newData3 = df[["SepalWidth", "SepalLength", "Class"]]
newData3["ClassGroup"] = df.Class.replace({"Iris-setosa":1, "Iris-versicolor":2, "Iris-virginica":3})
print(newData3)

     SepalWidth  SepalLength           Class  ClassGroup
0           3.5          5.1     Iris-setosa           1
1           3.0          4.9     Iris-setosa           1
2           3.2          4.7     Iris-setosa           1
3           3.1          4.6     Iris-setosa           1
4           3.6          5.0     Iris-setosa           1
..          ...          ...             ...         ...
145         3.0          6.7  Iris-virginica           3
146         2.5          6.3  Iris-virginica           3
147         3.0          6.5  Iris-virginica           3
148         3.4          6.2  Iris-virginica           3
149         3.0          5.9  Iris-virginica           3

[150 rows x 4 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newData3["ClassGroup"] = df.Class.replace({"Iris-setosa":1, "Iris-versicolor":2, "Iris-virginica":3})


# Activity: work with the iris dataset

Repeat this tutorial with the iris data set and respond to the following inquiries

1. Calculate the statistical summary for each quantitative variables. Explain the results
    - Identify the name of each column
    SepalLength, SepalWidth, PetalLength, PetalWidth, Class
    
    - Identify the type of each column
    SepalLength    float64
    SepalWidth     float64
    PetalLength    float64
    PetalWidth     float64
    Class           object
    
    - Minimum, maximum, mean, average, median, standar deviation:
    count	150.000000	150.000000	150.000000	150.000000
    mean	5.843333	3.057333	3.758000	1.199333
    std	0.828066	0.435866	1.765298	0.762238
    min	4.300000	2.000000	1.000000	0.100000
    max	7.900000	4.400000	6.900000	2.500000    

2. Are there missing data? If so, create a new dataset containing only the rows with the non-missing data
No existe ningun valor nulo o faltante.

3. Create a new dataset containing only the petal width and length and the type of Flower
En punto 3. 

4. Create a new dataset containing only the setal width and length and the type of Flower
En punto 4.

5. Create a new dataset containing the setal width and length and the type of Flower encoded as a categorical numerical column 
En el punto 5, nada más que sale un warning.