# What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

## What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

### Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:

C:\Users\Your Name>pip install pandas

### Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:

In [1]:
import pandas

In [2]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


### Pandas as pd
Pandas is usually imported under the pd alias.

In [3]:
import pandas as pd

In [4]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


## What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [5]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [6]:
print(myvar[0])

1


### Create Labels
With the index argument, you can name your own labels.

In [1]:
import pandas as pd

a = [1, 7, 2]

#myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

x    1
y    7
z    2
dtype: int64


When you have created labels, you can access an item by referring to the label.

In [8]:
print(myvar["y"])

7


## What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [9]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45


### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [10]:
#refer to the row index:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


In [11]:
#use a list of indexes:
print(df.loc[[0, 1]])

   calories  duration
0       420        50
1       380        40


### Named Indexes
With the index argument, you can name your own indexes.

In [12]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

      calories  duration
day1       420        50
day2       380        40
day3       390        45


### Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

In [13]:
#refer to the named index:
print(df.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


## Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [15]:
df = pd.read_csv('oedema.csv')
print(df.to_string()) 

      sex    age  weight  height  muac age_group oedema
0       1  18.83   10.30    78.2   148  12-23 mo      n
1       1  35.02   12.60    90.5   166  24-35 mo      n
2       1  48.00   16.40   101.6   164  48-59 mo      n
3       1  34.00   13.50    91.0   161  24-35 mo      n
4       1  18.79   12.40    83.8   160  12-23 mo      n
5       2  24.00    8.10    76.8   122  24-35 mo      n
6       1   6.00   12.50    65.0   156  06-11 mo      y
7       2  11.00    8.10    71.5   151  06-11 mo      n
8       2  42.00   10.60    89.5   149  36-47 mo      y
9       1  45.00   11.80    87.1   141  36-47 mo      n
10      1  23.00   12.80    78.1   160  12-23 mo      y
11      2  30.00   11.30    81.7   153  24-35 mo      n
12      1  18.00   12.10    81.5   152  12-23 mo      y
13      1   7.03    8.10    70.9   142  06-11 mo      n
14      1   4.99    6.80    67.1   135  00-05 mo      y
15      2  30.00   11.80    89.3   142  24-35 mo      n
16      2  36.00   12.50    93.0   166  36-47 mo

In [16]:
print(df)

      sex    age  weight  height  muac age_group oedema
0       1  18.83    10.3    78.2   148  12-23 mo      n
1       1  35.02    12.6    90.5   166  24-35 mo      n
2       1  48.00    16.4   101.6   164  48-59 mo      n
3       1  34.00    13.5    91.0   161  24-35 mo      n
4       1  18.79    12.4    83.8   160  12-23 mo      n
...   ...    ...     ...     ...   ...       ...    ...
1175    2  34.00    12.1    79.9   142  24-35 mo      n
1176    1  37.00    13.9    94.5   145  36-47 mo      n
1177    1  22.00    14.0    78.0   152  12-23 mo      n
1178    2  12.00     8.0    74.1   133  12-23 mo      n
1179    1  18.00    10.1    75.0   153  12-23 mo      n

[1180 rows x 7 columns]


### max_rows
The number of rows returned is defined in Pandas option settings.

You can check your system's maximum rows with the pd.options.display.max_rows statement.

In [17]:
print(pd.options.display.max_rows) 

60


In [19]:
pd.options.display.max_rows = 9999

df = pd.read_csv('oedema.csv')

print(df)

      sex    age  weight  height  muac age_group oedema
0       1  18.83   10.30    78.2   148  12-23 mo      n
1       1  35.02   12.60    90.5   166  24-35 mo      n
2       1  48.00   16.40   101.6   164  48-59 mo      n
3       1  34.00   13.50    91.0   161  24-35 mo      n
4       1  18.79   12.40    83.8   160  12-23 mo      n
5       2  24.00    8.10    76.8   122  24-35 mo      n
6       1   6.00   12.50    65.0   156  06-11 mo      y
7       2  11.00    8.10    71.5   151  06-11 mo      n
8       2  42.00   10.60    89.5   149  36-47 mo      y
9       1  45.00   11.80    87.1   141  36-47 mo      n
10      1  23.00   12.80    78.1   160  12-23 mo      y
11      2  30.00   11.30    81.7   153  24-35 mo      n
12      1  18.00   12.10    81.5   152  12-23 mo      y
13      1   7.03    8.10    70.9   142  06-11 mo      n
14      1   4.99    6.80    67.1   135  00-05 mo      y
15      2  30.00   11.80    89.3   142  24-35 mo      n
16      2  36.00   12.50    93.0   166  36-47 mo

## Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method.

The head() method returns the headers and a specified number of rows, starting from the top.

In [20]:
df = pd.read_csv('oedema.csv')

print(df.head(10))

   sex    age  weight  height  muac age_group oedema
0    1  18.83    10.3    78.2   148  12-23 mo      n
1    1  35.02    12.6    90.5   166  24-35 mo      n
2    1  48.00    16.4   101.6   164  48-59 mo      n
3    1  34.00    13.5    91.0   161  24-35 mo      n
4    1  18.79    12.4    83.8   160  12-23 mo      n
5    2  24.00     8.1    76.8   122  24-35 mo      n
6    1   6.00    12.5    65.0   156  06-11 mo      y
7    2  11.00     8.1    71.5   151  06-11 mo      n
8    2  42.00    10.6    89.5   149  36-47 mo      y
9    1  45.00    11.8    87.1   141  36-47 mo      n


### Note: if the number of rows is not specified, the head() method will return the top 5 rows.

In [21]:
df = pd.read_csv('oedema.csv')

print(df.head())

   sex    age  weight  height  muac age_group oedema
0    1  18.83    10.3    78.2   148  12-23 mo      n
1    1  35.02    12.6    90.5   166  24-35 mo      n
2    1  48.00    16.4   101.6   164  48-59 mo      n
3    1  34.00    13.5    91.0   161  24-35 mo      n
4    1  18.79    12.4    83.8   160  12-23 mo      n


there is also a tail() method for viewing the last rows of the DataFrame.

The tail() method returns the headers and a specified number of rows, starting from the bottom.

In [22]:
print(df.tail())

      sex   age  weight  height  muac age_group oedema
1175    2  34.0    12.1    79.9   142  24-35 mo      n
1176    1  37.0    13.9    94.5   145  36-47 mo      n
1177    1  22.0    14.0    78.0   152  12-23 mo      n
1178    2  12.0     8.0    74.1   133  12-23 mo      n
1179    1  18.0    10.1    75.0   153  12-23 mo      n


## Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

In [23]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1180 entries, 0 to 1179
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sex        1180 non-null   int64  
 1   age        1180 non-null   float64
 2   weight     1180 non-null   float64
 3   height     1180 non-null   float64
 4   muac       1180 non-null   int64  
 5   age_group  1180 non-null   object 
 6   oedema     1180 non-null   object 
dtypes: float64(3), int64(2), object(2)
memory usage: 64.7+ KB
None
