I decided to start with the basics course on Pandas, and I was already a bit confused about the course outline. Based on my current knowledge, it seemed a bit counterintuitive, but I might be proven wrong.. 

## Pandas 
Pandas is built on two python packages:
- **numpy** (multidimensional array objects for manipulation and storing)
- **matplotlib** (data visualization)

Almost the whole python data science community uses pandas. 

Rectangular (tabular) data is the most common form to store data for analysis. Pandas uses **DataFrame** objects to represent tabular data. Different columns can contain different data types (text or numeric). 

### Transforming dataframes
#### 1. Exploring the data
- look into the data by `df.head()` - showing the first 5 rows by default. 

In [2]:
import numpy as np
import pandas as pd

# hardcoded dataframe 
df = pd.DataFrame({
   'col1': ['Item0', 'Item0', 'Item1', 'Item1'],
   'col2': ['Gold', 'Bronze', 'Gold', 'Silver'],
   'col3': [1, 2, np.nan, 4]
})

df.head()

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
1,Item0,Bronze,2.0
2,Item1,Gold,
3,Item1,Silver,4.0


- use `df.info()` method to display the names of columns, the data types they contain, and whether they have any missing values.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
col1    4 non-null object
col2    4 non-null object
col3    3 non-null float64
dtypes: float64(1), object(2)
memory usage: 224.0+ bytes


- `df.shape` returns a tuple that indicates the number of rows and columns

(this is an attribue, not a method, therefore no `()` is needed)

In [9]:
df.shape

(4, 3)

- `df.describe()` shows summary of numerical columns (mean, median), the **count** of non-missing values

In [10]:
df.describe()

Unnamed: 0,col3
count,3.0
mean,2.333333
std,1.527525
min,1.0
25%,1.5
50%,2.0
75%,3.0
max,4.0


- `df.values` returns the data values stored in a two-dimensional numpy array

In [12]:
df.values

array([['Item0', 'Gold', 1.0],
       ['Item0', 'Bronze', 2.0],
       ['Item1', 'Gold', nan],
       ['Item1', 'Silver', 4.0]], dtype=object)

- `df.columns` and `df.index` represent the labels for the columns (columns) and rows (indexes)

In [13]:
df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

In [14]:
df.index

RangeIndex(start=0, stop=4, step=1)

#### 2.A Sorting
- changing the order of the rows by sorting them
- using the `sort_values` method, passing in a column name to be sorted by
- to **reverse the order**, set the optional argument `ascending =` to `False`
- to **sort by multiple columns**, pass a list of columns to the `by =` argument
- to specify the direction of sorting for multiple columns, pass a list to the `ascending =` argument

In [17]:
df.sort_values(by = 'col3')

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
1,Item0,Bronze,2.0
3,Item1,Silver,4.0
2,Item1,Gold,


In [18]:
df.sort_values(by = 'col3', ascending = False)

Unnamed: 0,col1,col2,col3
3,Item1,Silver,4.0
1,Item0,Bronze,2.0
0,Item0,Gold,1.0
2,Item1,Gold,


In [23]:
df.sort_values(by = ['col2', 'col3'])

Unnamed: 0,col1,col2,col3
1,Item0,Bronze,2.0
0,Item0,Gold,1.0
2,Item1,Gold,
3,Item1,Silver,4.0


In [25]:
df.sort_values(by = ['col2', 'col3'], ascending = [False, True])

Unnamed: 0,col1,col2,col3
3,Item1,Silver,4.0
0,Item0,Gold,1.0
2,Item1,Gold,
1,Item0,Bronze,2.0


#### 2.B Subsetting 
- use `df['column_name']` to subset a column
- to subset multiple columns, use an extra pair of brackets to indicate a list of colums: `df[['col2', 'col1']]`
- add logical condition to subsetting eg. `df['col3'] >= 2` - this returns boolean values 
- use this type of subsetting inside square backets to see the entire rows: `df[df['col3'] >= 2]`

In [28]:
df['col2']

0      Gold
1    Bronze
2      Gold
3    Silver
Name: col2, dtype: object

In [30]:
df[['col2', 'col1']]

Unnamed: 0,col2,col1
0,Gold,Item0
1,Bronze,Item0
2,Gold,Item1
3,Silver,Item1


In [32]:
df['col3'] >= 2

0    False
1     True
2    False
3     True
Name: col3, dtype: bool

In [34]:
df[df['col3'] >= 2]

Unnamed: 0,col1,col2,col3
1,Item0,Bronze,2.0
3,Item1,Silver,4.0


In [35]:
df[df['col2'] == 'Gold']

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
2,Item1,Gold,


- the `.isin()` method helps filtering for a list of values

In [42]:
list_of_values = df['col2'].isin(['Gold', 'Silver'])
df[list_of_values]

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
2,Item1,Gold,
3,Item1,Silver,4.0


In [43]:
# in one line
df[df['col2'].isin(['Gold', 'Silver'])]

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
2,Item1,Gold,
3,Item1,Silver,4.0


#### 3. Creating new columns 
- it's often referred to as mutating or transforming the dataframe
- eg. by modifying an existing column: `df['new_column'] = df['exisiting_column'] / 100`
- or fill it with the same data or NaN: `df['new_column'] = np.NaN`


In [3]:
df['col_new'] = df['col3'] * 4


Unnamed: 0,col1,col2,col3,col_new
0,Item0,Gold,1.0,4.0
1,Item0,Bronze,2.0,8.0
2,Item1,Gold,,
3,Item1,Silver,4.0,16.0


In [5]:
df['col_new_2'] = np.NaN
df

Unnamed: 0,col1,col2,col3,col_new,col_new_2
0,Item0,Gold,1.0,4.0,
1,Item0,Bronze,2.0,8.0,
2,Item1,Gold,,,
3,Item1,Silver,4.0,16.0,
