# 1. Introduction to Pandas


Pandas is an open source library and one of the most popular python libraries using for data analysis and data science. Pandas allows:


1.   exploring, cleanning, importing, and processing data from various file formats such as comma-separated values (CSV), JSON, SQL database tables or queries, and Microsoft Excel.
2.  various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features. 






There is two way to install pandas


*   As part of anaconda distribution, pandas can be installed using conda methods: **conda install pandas**  
*   Or installed via pip command from PyPI *packages*



In [None]:
pip install pandas



Before get started with pandas, we need to import it



In [None]:
import pandas as pd


To check the version of Pandas you are running, do:


In [None]:
pd.__version__

'1.1.5'

To get started with pandas, we need to be comfortable with the data structure. Pandas is well suited for many different kinds of data such as

*   Tabular data with heterogeneously-typed columns, 

* Ordered and unordered time series data.
* Arbitrary matrix data with row and column labels
* Any other form of observational / statistical data sets. 

The data need not be labeled at all to be placed into a pandas data structure. There are two primary data structure: **Series (1-dimensional)**, and **DataFrame (2-dimensional)**.

Every object successfully returned by  Pandas is either  **Series** or **DataFrame**  




**DataFrames** and **Series** are not simply storage containers. Since Pandas treat them similarly, they have built-in support for a variety of data-wrangling operations, such as:

* Single-level and hierarchical indexing
* Handling missing data
* Arithmetic and Boolean operations on entire columns and tables
* Database-type operations (such as merging and aggregation)
* Plotting individual columns and whole tables
* Reading data from files and writing data to files

## 2. Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.


You can create a simple series from any sequence: **a list, a tuple, or a numpy array** or even **a python dictionary**.

In [None]:
# From a list
l = [1,3,19,37]
s1 = pd.Series(l)
s1

0     1
1     3
2    19
3    37
dtype: int64

In [None]:
# From List of string
ls = ["I", "enjoy", "working", "with", "pandas"]
s2 = pd.Series(ls)
s2

0          I
1      enjoy
2    working
3       with
4     pandas
dtype: object

In [None]:
# From a list of object
lo = ["Pizza", 'number', 1, 'is ', 15.5, "$"]
s3 = pd.Series(lo)
s3

0     Pizza
1    number
2         1
3       is 
4      15.5
5         $
dtype: object

In [None]:
# From a tuple
t = (0.5, 3, 18)
s4 = pd.Series(t)
s4

0     0.5
1     3.0
2    18.0
dtype: float64

In [None]:
# From a numpy array
#Let's import the numpy library first
import numpy as np

a1 = np.arange(5)
s5 = pd.Series(a1)
s5

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [None]:
# From a python dictionary
d1={
    'name': "Pandas",
    "version": 1.1,
    'library': "python"
}
print("dictionary:", d1)
s6 = pd.Series(d1)
print("\nserie:\n",s6)

dictionary: {'name': 'Pandas', 'version': 1.1, 'library': 'python'}

serie:
 name       Pandas
version       1.1
library    python
dtype: object


When printing a series, there is column on the left always appearing. 

This column is called **column index** which, by default run from 0 to n-1 where n is the length of the series.

In the case of a dictionary, it is automatically replaced by **the key** of the dictionary, and the **values** of the dictionary are the actual content of the Pandas Series.

Let's run the cell below


In [None]:
print("index:",s6.index)
print("\nvalues:",s6.values)

index: Index(['name', 'version', 'library'], dtype='object')

values: ['Pandas' 1.1 'python']


 We can also create a **series** along with its indices

In [None]:
# An information recorded from a student during a recording
s7 = pd.Series( ["Naomy" , "Cameroon" , "Single" , "French"],
              index = ["Name" , "Nationality" , "Status" , "Language"])
s7

Name              Naomy
Nationality    Cameroon
Status           Single
Language         French
dtype: object

In [None]:
#The mark over 20 of a student in the following courses 
#recorded during a survey
s8 =  pd.Series([12 , 17 , 15 , 13 , 9],
               index = ["Mathematic" , "Python" , "Computer science", "Data science" , "Statistic"])
s8

Mathematic          12
Python              17
Computer science    15
Data science        13
Statistic            9
dtype: int64

We can get to each of the terms easily

In [None]:
print("Mathematic:", s8["Mathematic"])
print("Python:", s8["Python"])

Mathematic: 12
Python: 17


We can check the number of courses **below the mark of 15**

In [None]:
s8 < 15

Mathematic           True
Python              False
Computer science    False
Data science         True
Statistic            True
dtype: bool

It returns Boolean values: True where there is a match and False otherwise. 

If we want to get the actual values run the below cell

In [None]:
s8[s8<15]

Mathematic      12
Data science    13
Statistic        9
dtype: int64

We can apply some statistical operation on a series

In [None]:
# sum of marks
print('Total:',s8.sum())

#Lower mark
print('min:',s8.min())

# Average mark
print('mean:', s8.mean())

# largest value
print('max:', s8.max())

Total: 66
min: 9
mean: 13.2
max: 17


# 3. DataFrame

DataFrame is a two-dimensional data structure and its corresponding labels (row labels and column labels). Like Series, DataFrame accepts many different types of input such as Dict of 1D ndarrays, lists, dicts, Series, 2-D numpy.ndarray , or another DataFrame. It is similar to the SQL tables or spreadsheets we work with in Excel or Calc, and widely used in data science, machine learning, scientific computing, and many other data-intensive fields. 




A DataFrame can be viewed as a collection of Series. 

There are many ways to construct a DataFrame, one of the most common way is from a dict of equal-length lists or NumPy arrays.



In [None]:
# List of cities and thier countries
d1 = {'City': ['Paris', 'Yaounde', 'Kigali', 'Accra'], 
       'Country': ['France', 'Cameroon', 'Rwanda', 'Ghana']}
df1 = pd.DataFrame(d1)
df1       

Unnamed: 0,City,Country
0,Paris,France
1,Yaounde,Cameroon
2,Kigali,Rwanda
3,Accra,Ghana


As we can see from the above result, the DataFrame is like a table with rows and columns.

To query the content of a dataframe, we can call the different series

In [None]:
df1['City']

0      Paris
1    Yaounde
2     Kigali
3      Accra
Name: City, dtype: object

In [None]:
df1['Country']

0      France
1    Cameroon
2      Rwanda
3       Ghana
Name: Country, dtype: object

We might decide to add another column. For instance, the tongue language

In [None]:
df1['Language'] = ['French', 'French', 'English', 'English']

 From a dataframe, we can get a subset of dataframe satisfying certain conditions.

 Q: What are the countries which speak english

In [None]:
df1[df1['Language']=='English']

Unnamed: 0,City,Country,Language
2,Kigali,Rwanda,English
3,Accra,Ghana,English


## Locate row

Pandas use the **loc** attribute to return one or more specified row(s)




In [None]:
# return the first row of the dataframe df1
df1.loc[0]

City         Paris
Country     France
Language    French
Name: 0, dtype: object

The above result is a **Pandas series**.


We can use a list of index to return more than one rows


In [None]:
# Use the list of index of the specifique row
# return row 0 and 1
df1.loc[[0,1]]

Unnamed: 0,City,Country,Language
0,Paris,France,French
1,Yaounde,Cameroon,French


 ### Rename the index

 With the **index** argument we can name our own indexes

In [None]:
# add list of row name as index
# The temperature of some students recording during the week
data = {'Name':['Pauline', 'Sandra', 'Domini', 'Henri', "Sabrina"],
        'Temperature (°C)': [29,32,27,19,35]}

df2 = pd.DataFrame(data, index = ["Monday", 'Tuesday', "Wednesday", 'Thursday', 'Friday'])
df2

Unnamed: 0,Name,Temperature (°C)
Monday,Pauline,29
Tuesday,Sandra,32
Wednesday,Domini,27
Thursday,Henri,19
Friday,Sabrina,35


In [None]:
#  Use the named index in the loc attribute to return the specified row(s). 
df2.loc["Monday"]

Name                Pauline
Temperature (°C)         29
Name: Monday, dtype: object

## Load Files Into a DataFrame

We will be using iris dataset for the below example.

Iris dataset contains 3 types of Iris plants ( setosa, virginica, versicolor) where each type have four attributes (the length and width of sepal and petal).


In [None]:
# We first mount the drive to acces to the path of dataset file 
# Load a comma separated file (CSV file) into a DataFrame:

df3 = pd.read_csv('/content/drive/MyDrive/iris.csv')

# print the first 5 rows of the dataframe df3
df3.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [None]:
# Information about the dataframe
# With the info() function we get the number of features, number of samples per feature, 
# and tell us if the features have a null value or not (missing data)
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sepal_length   150 non-null    float64
 1    sepal_width   150 non-null    float64
 2    petal_length  150 non-null    float64
 3    petal_width   150 non-null    float64
 4    class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB



The shape attribute of pandas.DataFrame stores the number of rows and columns as a tuple (number of rows, number of columns) . 

In [None]:
# The shape of the dataframe
df3.shape

(150, 5)

### Iterate over columns

In order to iterate over columns, we need to create a list of dataframe columns and then iterating through that list to pull out the dataframe columns.


In [None]:
# creating a list of dataframe columns
columns = list(df3)
print("columns:", columns) 
for i in columns:
 
    # printing the third element of the column
    print (df3[i][2])



columns: ['sepal_length', ' sepal_width', ' petal_length', ' petal_width', ' class']
4.7
3.2
1.3
0.2
Iris-setosa


### Missing data

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.

In order to check missing values in Pandas DataFrame, we use a function **isnull()** and **notnull()**. Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.



According to info() above the dataframe df3 have no missing value. 

Let us create a new dataframe with missing values

In [None]:
# np.nan is the missing value

iris = {'sepal_length':[100, 90, np.nan, 95,23],
        'sepal_width': [30, 45, 56, np.nan, 6],
        'petal_length':[np.nan, 40, 80, 98, 9],
        'petal_width':[np.nan, 17, 5, 17,32]}
 
# creating a dataframe from list
df4 = pd.DataFrame(iris)
 
# using isnull() function  
df4.isnull()



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,False,False,True,True
1,False,False,False,False
2,True,False,False,False
3,False,True,False,False
4,False,False,False,False


The above result return a Boolean. False for missing values and True not-null values

### Filling missing 

Filling missing values using **fillna()**, **replace()** and **interpolate()** :

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.


In [None]:
# filling missing value using fillna()  
# Filling nan with the value 0
df4.fillna(0)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,100.0,30.0,0.0,0.0
1,90.0,45.0,40.0,17.0
2,0.0,56.0,80.0,5.0
3,95.0,0.0,98.0,17.0
4,23.0,6.0,9.0,32.0


The NaN can be filling by another values depend on the our goal.

We can also decide to drop all the rows with at least one Nan value (Null value) using **dropna()** function.

In [None]:
# using dropna() function  
df4.dropna()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
1,90.0,45.0,40.0,17.0
4,23.0,6.0,9.0,32.0
