In [3]:
import pandas as pd

## 1. df.loc(): 
### Access a group of rows and columns ***by label(s) or a boolean array***.

In [4]:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],

     index=['cobra', 'viper', 'sidewinder'],

     columns=['max_speed', 'shield'])

In [6]:
### Get by row
df.loc['viper']

max_speed    4
shield       5
Name: viper, dtype: int64

In [7]:
### Get multiple rows
df.loc[['viper', 'sidewinder']]

Unnamed: 0,max_speed,shield
viper,4,5
sidewinder,7,8


In [8]:
### Get by row and column
df.loc['cobra', 'shield']

2



### Values can also be set just as we get values by different ways.

### See more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html




## 2. df.iloc(): Purely integer-location based indexing for ***selection by position***.


In [9]:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},

          {'a': 100, 'b': 200, 'c': 300, 'd': 400},

          {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

df = pd.DataFrame(mydict)

df

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [10]:
# Get value by integer index
type(df.iloc[0])
df.iloc[0]

a    1
b    2
c    3
d    4
Name: 0, dtype: int64

In [11]:
# Get value by list index
df.iloc[[0]]

Unnamed: 0,a,b,c,d
0,1,2,3,4


In [12]:
type(df.iloc[[0]])

pandas.core.frame.DataFrame

In [13]:
df.iloc[[0, 1]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400


In [14]:
df.iloc[:3]

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,100,200,300,400
2,1000,2000,3000,4000


In [17]:
# Get value by bool array (get row)
df.iloc[[True, False, True]]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


In [16]:
df.iloc[lambda x: x.index % 2 == 0]

Unnamed: 0,a,b,c,d
0,1,2,3,4
2,1000,2000,3000,4000


### Indexing with ***both axes***


In [18]:
df.iloc[0, 1]

2

In [19]:
df.iloc[[0, 2], [1, 3]]

Unnamed: 0,b,d
0,2,4
2,2000,4000


In [20]:
df.iloc[1:3, 0:3]

Unnamed: 0,a,b,c
1,100,200,300
2,1000,2000,3000


In [21]:
df.iloc[:, [True, False, True, False]]

Unnamed: 0,a,c
0,1,3
1,100,300
2,1000,3000


In [22]:
df.iloc[:, lambda df: [0, 2]]

Unnamed: 0,a,c
0,1,3
1,100,300
2,1000,3000


### See more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

## 3. python slice() 

```
a[start:stop:step] is equivalent to a[slice(start, stop, step)]

```


## 4. Because pandas is column-major, if you want to do multiple slicing operations, always do the column-based slicing operations first.

For example, if you want to get the review from the first row of the data, there are two slicing operations:

    get row (row-based operation)
    get review (column-based operation)

### ***Get row -> get review is 25x slower than get review -> get row.***




## 5. Setting index for DataFrame: By default, an increased index will be added to our DataFrame. To set this field to different values, we can use df.set_index() function. This index can serve to do many things like grouping, ...

```
def company_type(x):
    hardware_companies = set(["Orange", "Dell", "IBM", "Siemens"])
    return "Hardware" if x["Company"] in hardware_companies else "Software"
df["Type"] = df.apply(lambda x: company_type(x), axis=1)

# Setting "Type" to be labels. We call ""
df = df.set_index("Type")
df
```


## 6. Common pitfals

- pandas is great for most day-to-day data analysis. It's instrumental to my job and I'm grateful that the entire pandas community is actively developing it. However, I think some of pandas design decisions are a bit questionable.

- Some of the common pandas pitfalls:
### 6.1 NaNs

- NaNs are stored as floats in pandas, so when an operation fails because of NaNs, it doesn't say that there's a NaN but because that operation doesn't exist for floats.

### 6.2 Changes not Inplace

- Most pandas operations aren't inplace by default, so if you make changes to your DataFrame, you need to assign the changes back to your DataFrame. You can make changes inplace by setting argument inplace=True.
- In [39]:

### "Process" column is still in df
df.drop(columns=["Process"])
df.columns

- Out[39]:

Index(['Company', 'Title', 'Job', 'Level', 'Date', 'Upvotes', 'Offer',
       'Experience', 'Difficulty', 'Review', 'Process'],
      dtype='object')

- In [40]:

### To make changes to df, set `inplace=True`
df.drop(columns=["Process"], inplace=True)
df.columns
### This is equivalent to
### df = df.drop(columns=["Process"])

- Out[40]:

Index(['Company', 'Title', 'Job', 'Level', 'Date', 'Upvotes', 'Offer',
       'Experience', 'Difficulty', 'Review'],
      dtype='object')

### 6.3 Performance issues with very large datasets
### 6.4 Reproducibility issues

- Especially with dumping and loading DataFrame to/from files. There are two main causes:

    Problem with labels (see the section about labels above).
    Weird rounding issues for floats.

### 6.5 Not GPU compatible

- pandas can't take advantage of GPUs, so if your computations are on on GPUs and your feature engineering is on CPUs, it can become a time bottleneck to move data from CPUs to GPUs. If you want something like pandas but works on GPUs, check out dask and modin.


## Reference

https://github.com/chiphuyen/just-pandas-things/blob/master/just-pandas-things.ipynb