# 2. Introduction to Pandas (LAYG - Learn-As-You-Go)

References: 
* https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html
* https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
* https://pandas.pydata.org/pandas-docs/stable/reference/series.html

It is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

## 2.1 Import pandas library

In [None]:
import pandas as pd

Pandas is well suited for many different kinds of data:
* Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Data structures in pandas are:

* `Series` objects: 1D array, similar to a column in a spreadsheet
* `DataFrame` objects: 2D table, similar to a spreadsheet
* Panel objects: Dictionary of DataFrames, similar to sheet in MS Excel

## 2.2 Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

![series](https://media.geeksforgeeks.org/wp-content/uploads/20200225170506/pandas-series.png)

### 2.2.1 Create a Serie

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc.

We can easily create a Pandas `Series` from a NumPy array:

In [None]:
import pandas as pd
import numpy as np

ndarray = np.array(['a','b','c','d'])
serie = pd.Series(ndarray)
print(serie)

Or simply by declaring it

In [None]:
data = pd.Series([1, 2, 3, 4, 5])
print(data)

The indices can be integers like in the example above. Alternatively, the indices can be labels.

In [None]:
data_1 = [1,2,3]
index_1 = ['a', 'b', 'c']
data = pd.Series(data_1, index=index_1)
data

Also, `Series` can be easily created from dictionaries

In [None]:
dictionary = {'A':5, 'B':3, 'C':1}
data = pd.Series(dictionary)
data

## 2.3 DataFrame

If a `Series` is a one-dimensional indexed array, the `DataFrame` is a two-dimensional indexed array. It can be thought of as a collection of Series objects, where each Series represents a column, or as an enhanced 2D numpy array. 

In a `DataFrame`, the index refers to labels for each row, while columns describe each column.

![dataframe](https://media.geeksforgeeks.org/wp-content/uploads/20200225170602/pandas-dataframe.png)

### 2.3.1 Create a data frame

Just like Series, In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.

First, let's create a `DataFrame` using random numbers generated from numpy.

In [None]:
import numpy as np
numpy_array = np.random.random((5, 3))
data = pd.DataFrame(numpy_array)
data

We could also initialize with column names

In [None]:
data = pd.DataFrame(np.random.random((5, 3)), columns=['A', 'B', 'C'])
data

Alternatively, we could create a `DataFrame` from multiple `Series` objects.

In [None]:
a = pd.Series([1, 2, 3, 4])
#a = [1,2,3,4]
b = pd.Series(['a', 'b', 'c', 'd'])
#b = ['a', 'b', 'c', 'd']
data = pd.DataFrame({'Numbers':a, 'Letters':b})
#data.index = ['A', 'B', 'C', 'D']
data

Or a python dictionary

In [None]:
data = pd.DataFrame({'Region':['AML', 'North', 'Center', 'Alentejo', 'Algarve', 'Azores', 'Madeira'], 
                     'Cities':['Lisbon', 'Porto', 'Coimbra', 'Beja', 'Faro', 'Ponta Delgada', 'Funchal']}, 
                    index=['AML', 'NOR', 'CEN', 'ALE', 'ALG', 'AZO', 'MAD'])
data

### 2.3.2 DataFrame - More examples

In [None]:
doggos = [
        ['Peter Parker','Doberman',3],
        ['John Reese','Golden Retriever',8],
        ['Bill Gates','Poodle',6],
        ['Mary Poppins','Beagle',1],
        ['Phoebe Buffay','Chihuahua',4],
        ['Sheldon Cooper','Siberian Husky',10],
        ['Patrick Jane','Siberian Husky',14]
]

In [None]:
doggos_df=pd.DataFrame(doggos,columns=['Name','Breed','Age'])
doggos_df

### 2.3.3 Previewing the data frame

[**DataFrame.head(n=5)**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head)
* This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it

In [None]:
doggos_df.head()

[**DataFrame.tail(n=5)**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail)
* This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows

In [None]:
doggos_df.tail(3)

[**DataFrame.shape**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html#pandas.DataFrame.shape)
* Return a tuple representing the dimensionality of the DataFrame.

In [None]:
doggos_df.shape

In [None]:
doggos_df.shape[0]

[**DataFrame.columns**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html#pandas.DataFrame.columns)
* The column labels of the DataFrame

In [None]:
doggos_df.columns

[**DataFrame.dtypes**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes)
* Return the dtypes in the DataFrame.
* This returns a Series with the data type of each column.
* The result’s index is the original DataFrame’s columns.
* Columns with mixed types are stored with the object dtype.

In [None]:
doggos_df.dtypes

[**DataFrame.values**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values)
* Return a Numpy representation of the DataFrame.
* Python documentation recommends using DataFrame.to_numpy() instead.
* Only the values in the DataFrame will be returned, the axes labels will be removed.

In [None]:
doggos_df.values

### 2.3.4 Some Statistics

The DataFrame member function [`describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) provides useful summary statistics such as the total number of samples, mean, standard deviations, min/max, and quartiles for each column of the DataFrame. 

[**DataFrame.describe(percentiles=None, include=None, exclude=None)**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)
* Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
* Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

In [None]:
some_stats = doggos_df.describe()
some_stats

[**Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)
* Return a Series containing counts of unique values.
* The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

In [None]:
doggos_df['Breed'].value_counts()

## 2.4 Loading Data
Pandas offers some of the best utilities available for reading/parsing data from text files. The function read_csv has numerous options for managing header/footer lines in files, parsing dates, selecting specific columns, etc in comma separated value (CSV) files. The default index for the Dataframe is set to a set of monotonically increasing integers unless otherwise specified with the keyword argument `index_col`.

There are similar functions for reading Microsoft Excel spreadsheets (read_excel) and fixed width formatted text (read_fwf).

Loading data is fairly straightfoward in Pandas. Type `pd.read` then press tab to see a list of functions that can load specific file formats such as: csv, excel, spss, and sql.

In this example, we will use `pd.read_csv` to load a .csv file into a dataframe. 
Note that read_csv() has many options that can be used to make sure you load the data correctly. You can explore the docstrings for a function to get more information about the inputs and general useage guidelines by running `pd.read_csv?` 

In [None]:
pd.read_csv

To load a csv file we will need to specify either the relative or absolute path to the file.

The command `pwd` will print the path of the current working directory.

In [None]:
pwd

We will now load the Pandas has many ways to read data different data formats into a dataframe.  Here we will use the `pd.read_csv` function.

In [None]:
cars = pd.read_csv('cars.csv', sep=';')
cars

### 2.4.1 Exercises

All the exercises below are based on the dataset `cars.csv` from Class 4

a) Present the content of the first 5 lines:

In [None]:
cars.head()

b) Present the content of the last 10 lines:

In [None]:
cars.tail(10)

c) Which makes / models of cars have more occurrences?

In [None]:
cars['Car'].value_counts()

d) How many cars are of European origin?

In [None]:
unique_origins = cars['Origin'].value_counts()
unique_origins['Europe']

d1) Filter out all cars of European origin

In [None]:
filter_europe = cars['Origin'] == 'Europe'
#filter_europe
cars[filter_europe]

e) Show some statistics about the dataset.

In [None]:
cars.describe()