# Pandas

In [3]:
import pandas as pd

When using Pandas we have two main data types / objects:
- `Series` - we can think of it as singular column
- `DataFrame` - it's a collection of series objects

## Series

- are on NumPy arrays
- one Series object can store only one data type, like a NumPy array
- we have index and values
    - index - by default we have RangeIdex which gives us numbers starting from 0, like in a python list. But! We can have different types of indexes, for example strings (like in python dictionary). We can provide index values together with values when creating Series object
    - values - can be anything but of the same type throughout the Series object

In [9]:
names = pd.Series(['Piotr', 'Tom', 'Chris', 'Ann', 'Paul'])
names, names.index

(0    Piotr
 1      Tom
 2    Chris
 3      Ann
 4     Paul
 dtype: object,
 RangeIndex(start=0, stop=5, step=1))

In [13]:
names[0]

'Piotr'

In [15]:
names[0:2]

0    Piotr
1      Tom
dtype: object

In [17]:
names[::2]

0    Piotr
2    Chris
4     Paul
dtype: object

In [19]:
names[::-1]

4     Paul
3      Ann
2    Chris
1      Tom
0    Piotr
dtype: object

In [21]:
# https://pandas.pydata.org/docs/reference/api/pandas.Series.html
countries = pd.Series(['Poland', 'United States', 'Germany', 'France'], index=['PL', 'US', 'DE', 'FR'])
countries

PL           Poland
US    United States
DE          Germany
FR           France
dtype: object

In [22]:
countries.index

Index(['PL', 'US', 'DE', 'FR'], dtype='object')

In [24]:
countries.values

array(['Poland', 'United States', 'Germany', 'France'], dtype=object)

In [26]:
type(countries.values)

numpy.ndarray

In [29]:
countries[0], countries['PL']

('Poland', 'Poland')

## DataFrame

- it's a collection of series
- two-dimensional array, which consists of columns and rows

connected together by shared index.

We use several options to create a DataFrame:
- we can use existing python collections
- we can read files in different formats (xlsx, csv, JSON, etc.)
    - local files
    - URLs
- we can read the data from databases (MySQL, Oracle, etc.)

### Creating DataFrame by providing data by rows

In [33]:
df = pd.DataFrame([
    ['Piotr', 18, 'M'],
    ['Tom', 30, 'M'],
    ['Ann', 25, 'F'],
    ['Jane', 35, 'F']
])
df

Unnamed: 0,0,1,2
0,Piotr,18,M
1,Tom,30,M
2,Ann,25,F
3,Jane,35,F


In [34]:
df = pd.DataFrame([
    ['Piotr', 18, 'M'],
    ['Tom', 30, 'M'],
    ['Ann', 25, 'F'],
    ['Jane', 35, 'F']
], columns=['First name', 'Age', 'Gender'])
df

Unnamed: 0,First name,Age,Gender
0,Piotr,18,M
1,Tom,30,M
2,Ann,25,F
3,Jane,35,F


In [35]:
df = pd.DataFrame([
    ['Piotr', 18, 'M'],
    ['Tom', 30, 'M'],
    ['Ann', 25, 'F'],
    ['Jane', 35, 'F']
], columns=['First name', 'Age', 'Gender'], index=[100, 110, 120, 130])
df

Unnamed: 0,First name,Age,Gender
100,Piotr,18,M
110,Tom,30,M
120,Ann,25,F
130,Jane,35,F


In [36]:
type(df)

pandas.core.frame.DataFrame

In [37]:
df.columns

Index(['First name', 'Age', 'Gender'], dtype='object')

In [38]:
df.index

Int64Index([100, 110, 120, 130], dtype='int64')

In [39]:
len(df)

4

In [42]:
df.size

12

In [43]:
df

Unnamed: 0,First name,Age,Gender
100,Piotr,18,M
110,Tom,30,M
120,Ann,25,F
130,Jane,35,F


### Creating DataFrame by providing data by columns

In [47]:
df = pd.DataFrame({
    'first name': ['Piotr', 'Tom', 'Ann', 'Jane'],
    'age': [18, 30, 25, 35],
    'gender': ['m', 'm', 'f', 'f']
}, index=[100, 200, 300, 400])
df

Unnamed: 0,first name,age,gender
100,Piotr,18,m
200,Tom,30,m
300,Ann,25,f
400,Jane,35,f


In [48]:
df.sort_values(by=['first name'])

Unnamed: 0,first name,age,gender
300,Ann,25,f
400,Jane,35,f
100,Piotr,18,m
200,Tom,30,m


In [51]:
df.to_excel('people.xlsx')  # Excel size limitations: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3#ID0EDBD=Newer_versions