### What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data

### Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.



### Getting started

In [7]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [8]:
import pandas

Pandas as pd

Pandas is usually imported under the pd alias.

In [9]:
import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.

In [11]:
mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


Checking library Version

The version string is stored under __version__ attribute.

In [12]:
print(pd.__version__)

1.4.4


## Pandas Series

### What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [14]:
a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


### Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

#### Return the first value of the Series:

In [15]:
print(myvar[0])

1


#### Create your own labels:

In [16]:
a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

x    1
y    7
z    2
dtype: int64


When you have created labels, you can access an item by referring to the label.

In [17]:
print(myvar["y"])

7


### Key/Value Objects as Series

You can also use a key/value object, like a dictionary, when creating a Series.

In [18]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


Note: The keys of the dictionary become the labels.

Create a Series using only data from "day1" and "day2":

In [21]:
myvar1 = pd.Series(calories, index = ["day1", "day2"])

print(myvar1)

day1    420
day2    380
dtype: int64


### DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

Create a DataFrame from two Series:

### What is a DataFrame?

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [25]:
data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45


### Locate Row

As you can see from the result above, the DataFrame is like a table with rows and columns.

Pandas use the loc attribute to return one or more specified row(s)

In [27]:
print(df.loc[0])
print(type(df.loc[0]))

calories    420
duration     50
Name: 0, dtype: int64
<class 'pandas.core.series.Series'>


Note: This example returns a Pandas Series.

In [31]:
#use a list of indexes:
print(df.loc[[1,2]])
print(type(df.loc[[1,2]]))

   calories  duration
1       380        40
2       390        45
<class 'pandas.core.frame.DataFrame'>


Note: When using [], the result is a Pandas DataFrame.

In [35]:
df.loc[1,2]

KeyError: 2

### Named Indexes

With the index argument, you can name your own indexes.

In [36]:
df1 = pd.DataFrame(data, index = ["day1", "day2", "day3"])
df1

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


### Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).

In [38]:
print(df1.loc["day2"])

calories    380
duration     40
Name: day2, dtype: int64


### Load Files Into a DataFrame

In [40]:
df = pd.read_csv('data.csv')

print(df) 

   calories  duration
0       420        50
1       380        40
2       390        45
3       300        60
4       430        35
5       210        45
