# DataFrames, Series and Lists

First of all, before we can do anything with Pandas, we have to import it, so Python and Jupyter Notebook know how to use it.
We will also load a csv file with some data in it into the variable testData.

In [5]:
import pandas as pd
testData = pd.read_csv("./data/startdata.csv",sep = ';', index_col = ['Date and time'], parse_dates = ['Date and time'], dayfirst = True)

<br>While working with Pandas you will be using **DataFrames, Series and Lists** most of the time. The first two come from Pandas, the List is the default python one.

Basically, a DataFrame has an index and multiple columns, a Series has an index and one column, and a List is just an array of values.

When working with DataFrames, you will often create new DataFrames or Series containing a limited subset of the original. One way to do this is by specifying which columns you want to use from the original. 

The format is slightly different depending on if you want a new DataFrame or Series:
```python
newDataFrame = OriginalDataFrame[['columnName']] OR OriginalDataFrame[['columnName', 'columnName2', ...]]
newSeries = OriginalDataFrame['columnName']
```
The difference is that for the DataFrame you use double square brackets "[[ ]]" and for Series single square brackets "[ ]" to indicate which columns to use.

When executing in Jupyter notebook, you can see the difference between a DataFrame and a Series, the Dataframes output is more like a formatted html table, where is series is raw textdata without formatting.

In [7]:
testData[['Value']].head(3)

Unnamed: 0_level_0,Value
Date and time,Unnamed: 1_level_1
1/01/2019 0:00,85
1/01/2019 1:00,76
1/01/2019 2:00,99


In [3]:
testData['Value'].head(3)

Date and time
2019-01-01 00:00:00    30
2019-01-01 06:00:00    73
2019-01-01 12:00:00    44
Name: Value, dtype: int64

If we need to access a column, like the one above, that has no spaces in it's name, we can use that to address it, without the ['Value']:
```python
testData.Value.head(3)
```

<br>If you want a new DataFrame with **multiple columns**, pass the columns in in the order you want them to be in:

In [12]:
testData[['Value2', 'Value']].head(3)

Unnamed: 0_level_0,Value2,Value
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1
1/01/2019 0:00,48,85
1/01/2019 1:00,64,76
1/01/2019 2:00,62,99


Another way to do this is the **.iloc** function. This takes a start and ending ofset for both rows and columns, which wil then be sliced out of the DataFrame and returned as a new one, in the order in which the data was in the original DataFrame. *iloc can not use the names of columns, it needs an integer*. 

In [8]:
testData.loc[0:5 , 1:3]

Unnamed: 0_level_0,Value,Value2
Date and time,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-01-01 00:00:00,30,42
2019-01-01 06:00:00,73,37
2019-01-01 12:00:00,44,51
2019-01-01 18:00:00,61,8
2019-01-02 00:00:00,26,1


<br>To create a DataFrame from raw data pass in the data and tell the constructor in which order to place the columns:

In [7]:
raw_data = {'two':   [2, 2, 2, 2],
            'one':   [1, 1, 1, 1],
            'three': [3, 3, 3, 3]}
df = pd.DataFrame(raw_data, columns = ['one', 'two', 'three'])
df                  

Unnamed: 0,one,two,three
0,1,2,3
1,1,2,3
2,1,2,3
3,1,2,3


This DataFrame has an index set automatically. To manually set another column as the index use the **.set_index()** function:

In [8]:
df.set_index('one')

Unnamed: 0_level_0,two,three
one,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,3
1,2,3
1,2,3
1,2,3


If we wanted to add the new index to the one that was allready in place, we would need to add the option **drop=False**

## Info about the DataFrame

To know the **size** of a DataFrame we can look at it's **shape** property, which returns the rows and columns in the DataFrame:

In [10]:
testData.shape

(48, 4)

## Concat

Concat alows us to concatenate multiple DataFrames toghether. 
```python
 out = pd.concat([FirstDF, SecondDF, ThirdDF,...])
```
They will be added below eachother. To add them next to each other we would use the option axis=1, but this is explained in the next part. 

## Merge

To add two DataFrames toghether, based on a common field in the two DataFrames, use **merge**:
```python
pd.merge(FirstDF, SecondDF, on='CommonColumn')
```
There is also the option to specify how to do the merge by adding the *how* option and setting it to *inner* or *outer*.
```python
pd.merge(FirstDF, SecondDF, on='CommonColumn', how='inner')
```
This merges the data based on CommonColumn and does an inner join: CommonColumn keys that appear in both DataFrames are used, those that appear in only one get rejected
```python
pd.merge(FirstDF, SecondDF, on='CommonColumn', how='outer')
```
This merges the data based on CommonColumn and does an outer join: every unique CommonColumn key in there will cause a new row. The blanks in rows that have a key in only one DataFrame will be filled with NaN. 

Next: [Adding columns to a DataFrame](03-Adding_columns_to_a_DataFrame.ipynb) | [Content](00-Content.ipynb)