# Data management using Pandas

**Data management** is a crucial component to statistical analysis and data science work. 

This notebook will show you how to import, view, undertand, and manage your data using the [Pandas](http://pandas.pydata.org) data processing library, i.e., the notebook will demonstrates how to read a dataset into Python, and obtain a basic understanding of its content.

Note that **Python** by itself is a general-purpose programming language and does not provide high-level data processing capabilities.  The **Pandas** library was developed to meet this need. **Pandas** is the most popular Python library for data manipulation, and we will use it extensively in this course. **Pandas** provides high-performance, easy-to-use data structures and data analysis tools.

The main data structure that **Pandas** works with is called a **Data Frame**. This is a two-dimensional table of data in which the rows typically
represent cases and the columns represent variables (e.g. data used in this tutorial).  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named `read_xxx` for reading data in different formats.  Right now we will focus on reading `csv` files, which stands for comma-separated values. However the other file formats include `excel`, `json`, and `sql`. 

There are many other options to `read_csv` that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for `read_csv`.


## Acknowledgments

- The dataset used in this tutorial is from https://www.coursera.org/ from the course "Understanding and Visualizing Data with Python" by University of Michigan


# Activity: work with the iris dataset

Repeat this tutorial with the iris data set and respond to the following inquiries

1. Calculate the statistical summary for each quantitative variables. Explain the results
    - Identify the name of each column
    - Identify the type of each column
    - Minimum, maximum, mean, average, median, standar deviation
    
    
2. Are there missing data? If so, create a new dataset containing only the rows with the non-missing data


3. Create a new dataset containing only the petal width and length and the type of Flower


4. Create a new dataset containing only the setal width and length and the type of Flower


5. Create a new dataset containing the setal width and length and the type of Flower encoded as a categorical numerical column 


In [1]:
# Import the packages that we will be using
import pandas as pnd

# Dataset url
path = "datasets/iris/iris.csv"

# Load the dataset
df = pnd.read_csv(path)


In [2]:
df.columns

Index(['SepLen', 'SepWd', 'PtlLen', 'PtlWd', 'Class'], dtype='object')

In [3]:
df.dtypes

SepLen    float64
SepWd     float64
PtlLen    float64
PtlWd     float64
Class      object
dtype: object

In [4]:
df.describe()

Unnamed: 0,SepLen,SepWd,PtlLen,PtlWd
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [5]:
df.isnull().sum()

SepLen    0
SepWd     0
PtlLen    0
PtlWd     0
Class     0
dtype: int64

In [6]:
dfPetal = df[["PtlLen", "PtlWd", "Class"]]
print(dfPetal)

     PtlLen  PtlWd           Class
0       1.4    0.2     Iris-setosa
1       1.4    0.2     Iris-setosa
2       1.3    0.2     Iris-setosa
3       1.5    0.2     Iris-setosa
4       1.4    0.2     Iris-setosa
..      ...    ...             ...
145     5.2    2.3  Iris-virginica
146     5.0    1.9  Iris-virginica
147     5.2    2.0  Iris-virginica
148     5.4    2.3  Iris-virginica
149     5.1    1.8  Iris-virginica

[150 rows x 3 columns]


In [7]:
dfSepal = df[["SepLen", "SepWd", "Class"]]
print(dfSepal)

     SepLen  SepWd           Class
0       5.1    3.5     Iris-setosa
1       4.9    3.0     Iris-setosa
2       4.7    3.2     Iris-setosa
3       4.6    3.1     Iris-setosa
4       5.0    3.6     Iris-setosa
..      ...    ...             ...
145     6.7    3.0  Iris-virginica
146     6.3    2.5  Iris-virginica
147     6.5    3.0  Iris-virginica
148     6.2    3.4  Iris-virginica
149     5.9    3.0  Iris-virginica

[150 rows x 3 columns]


In [11]:
dfSepalCategorical = df[["SepLen", "SepWd", "Class"]]
ClassGroup = df.Class.replace({"Iris-setosa": 1, "Iris-versicolour": 2, "Iris-virginica": 3})
dfSepalCategorical.drop("Class", axis=1, inplace = True)
dfSepalCategorical["ClassGroup"] = ClassGroup
print(dfSepalCategorical)

     SepLen  SepWd ClassGroup
0       5.1    3.5          1
1       4.9    3.0          1
2       4.7    3.2          1
3       4.6    3.1          1
4       5.0    3.6          1
..      ...    ...        ...
145     6.7    3.0          3
146     6.3    2.5          3
147     6.5    3.0          3
148     6.2    3.4          3
149     5.9    3.0          3

[150 rows x 3 columns]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfSepalCategorical["ClassGroup"] = ClassGroup
