### Introduction

#### What is Pandas?
- Pandas is a Python library used for working with data sets.
- It has functions for analyzing, cleaning, exploring, and manipulating data.
- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" ans was created by Wes McKinney in 2008

#### Why use Pandas?
- Pandas allows us to analyze big data and make conclusions based on statistical theories.
- Pandas can clean messy datasets, and make them readable and relevant.
- Relevant data is very important in Data Science.

#### What can Pandas do?
Pandas gives you answers about data. Like:
- Is there a correlation between two and more columns?
- What is average value?
- Max value?
- Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

Where is the Pandas Codebase?
The source code for Pandas is located at this [github repository](https://github.com/pandas-dev/pandas) 

### Getting Started

#### Installation of Pandas

#### Import Pandas

In [1]:
import pandas

Example:

In [2]:
myDataset = {
    'cars' : ['BMW', 'Volvo', 'Ford'],
    'passings' : [3, 7, 2]
}
myVar = pandas.DataFrame(myDataset)
print(myVar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


#### Checking Pandas Version

In [3]:
print(pandas.__version__)

2.2.3


### Series

#### What is a Series?
A Pandas Series is like a column in a table.<br>

It is a one-dimensional array holding data of any type. <br>

Example:

In [4]:
a = [1, 7, 2]
myVar = pandas.Series(a)
print(myVar)

0    1
1    7
2    2
dtype: int64


#### Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc. <br>

This label can be used to access a specified value.

Example <br>

    - Return the first value of the Series:

In [5]:
print(myVar[0])

1


#### Create Labels
With the index argument, you can name your own labels.

Example <br>
    
    - Create your own labels:

In [6]:
myVar = pandas.Series(a, index = ['x', 'y', 'z'])

When you have created labels, you can access an item by refering to the label.

Example

    - Return the value of 'y'

In [7]:
print(myVar['y'])

7


#### Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

Example

    - Create a simple Pandas Series from a dictionary:

In [8]:
calories = {
    'day1' : 420,
    'day2' : 380,
    'day3' : 390
}
myVar = pandas.Series(calories)
print(myVar)

day1    420
day2    380
day3    390
dtype: int64


To select only some of the items in the dictonary, use the index argument and specify only the items you want to include in the Series.

Example

    - Create a Series using only data from "day1" and "day2":

In [9]:
myVar = pandas.Series(calories, index = ['day1', 'day2'])
print(myVar)

day1    420
day2    380
dtype: int64


#### DataFrames
Datasets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table

Example

    - Create a DataFrame from two Series:

In [10]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
myVar = pandas.DataFrame(data)
print(myVar)

   calories  duration
0       420        50
1       380        40
2       390        45


### DataFrames


#### What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

Example

    - Create a simple Pandas DataFrame:

In [11]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
df = pandas.DataFrame(data)
print(df)

   calories  duration
0       420        50
1       380        40
2       390        45


#### Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

Example

    - Return row 0:

In [12]:
print(df.loc[0])

calories    420
duration     50
Name: 0, dtype: int64


    - Return row 0 and 1

In [13]:
# print(df.loc[[0, 1]])
print(df.loc[range(2)])

   calories  duration
0       420        50
1       380        40


#### Name Indexes
With the index argument, you can name your own indexes.

Example

    - Add a list of names to give each row a name:

In [14]:
df = pandas.DataFrame(data, index = ['day1', 'day2', 'day3'])
print(df)

      calories  duration
day1       420        50
day2       380        40
day3       390        45


#### Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s)

Example

    - Return 'day2':

In [15]:
print(df.loc['day2'])

calories    380
duration     40
Name: day2, dtype: int64


#### Load Files Into a DataFrame
If your datasets are stored in a file, Pandas can load them into a DataFrame

Example

    - Load a comma separated file (CSV file) into a DataFrame:

In [16]:
df = pandas.read_csv('test_data/data.csv')
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


### Read CSV

#### Read CSV Files
A simple way to store big data sets is to use CSV files (comma separate file).

CSV files contain plain text and is a well-known format that can read by everyone including Pandas.

Example

    - Load the CSV file into a DataFrame:

In [17]:
df = pandas.read_csv('test_data/data.csv')
print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

**Tip:** use to_string() to print the entire DataFrame
If you have a large DataFrame with many rows, Pandas will only return the first 5 rows, and the last 5 rows

Example

    - Print the DataFrame without the to_string() method:

In [18]:
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


#### max_rows
The number of rows returned is defined in Pandas option settings. <br>

You can check your system's maximum rows with the pandas.options.display.max_rows statement

Example 

    - Check the number of maximum returned rows:

In [19]:
print(pandas.options.display.max_rows)

60


You can change the maximum rows number with the same statement

Example

    - Increase the maximum number of rows to display the entire DataFrame:

In [20]:
pandas.options.display.max_rows = 10

print(pandas.read_csv('test_data/data.csv'))

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


### Read JSON


#### Read JSON
Big datasets are often stored, or extracted as JSON. <br>

JSON is plain text, but has the format of an object, and is well-known in the world of programming, including Pandas. <br>

Example

    - Load the JSON file into a DataFrame:

In [21]:
df = pandas.read_json('test_data/data.json')
print(df.to_string())

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

#### Dictionary as JSON

**JSON = Python Dictionary** <br>
JSON objects have the same format as Python Dictionaries.

If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly.

Example

    - Load a Python Dictionary into a DataFrame:

In [22]:
data = {
  "Duration":{
    "0":60,
    "1":60,
    "2":60,
    "3":45,
    "4":45,
    "5":60
  },
  "Pulse":{
    "0":110,
    "1":117,
    "2":103,
    "3":109,
    "4":117,
    "5":102
  },
  "Maxpulse":{
    "0":130,
    "1":145,
    "2":135,
    "3":175,
    "4":148,
    "5":127
  },
  "Calories":{
    "0":409,
    "1":479,
    "2":340,
    "3":282,
    "4":406,
    "5":300
  }
}

df = pandas.DataFrame(data)
print(df)

   Duration  Pulse  Maxpulse  Calories
0        60    110       130       409
1        60    117       145       479
2        60    103       135       340
3        45    109       175       282
4        45    117       148       406
5        60    102       127       300


### Analyzing Data

#### Viewing the Data
One of the most used method for getting a quick overview of the DataFrame, is the head() method. <br>

The head() method returns the headers and a specified number of rows, starting from the top. <br>

Example

    - Get a quick overview by printing the first 10 rows of the DataFrame:

In [23]:
df = pandas.read_csv('test_data/data.csv')
print(df.head(10))

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.5
6        60    110       136     374.0
7        45    104       134     253.3
8        30    109       133     195.1
9        60     98       124     269.0


**Note:** if the number of rows is not specified, the head() method will return the top 5 rows.

Example:

    - Print the first 5 rows of the DataFrame:

In [24]:
print(df.head())

   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0


There is also a tail() method for viewing the last rows of the DataFrame. <br>

The tail() method returns the headers and a specified number of rows, starting from the bottom. <br>

Example

    - Print the last 5 rows of the DataFrame:

In [25]:
print(df.tail())

     Duration  Pulse  Maxpulse  Calories
164        60    105       140     290.8
165        60    110       145     300.4
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4


#### Info About the Data
The DataFrames object has a method called info(), that gives you more information about the data set.

Example

    - Print information about the data:

In [26]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


#### Null Values
The info() method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories" column. <br>

Which means that there are 5 rows with no value at all, in the "Calories" column, for whatever reason. <br>

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called *cleaning data*.