In [3]:
import pandas as pd
import numpy as np

## creating random dataframe

In [3]:
## specify column names and values with a dictionary
pd.DataFrame({'col_one':[10,20], 'col_two':[30, 40]})

Unnamed: 0,col_one,col_two
0,10,30
1,20,40


In [5]:
# fill dataframe with random values
pd.DataFrame(np.random.rand(2, 3), columns = list('abc'))

Unnamed: 0,a,b,c
0,0.214881,0.402507,0.113398
1,0.447643,0.186155,0.193834


In [6]:
# create dataframe with mixed dataframe
pd.util.testing.makeMixedDataFrame()

Unnamed: 0,A,B,C,D
0,0.0,0.0,foo1,2009-01-01
1,1.0,1.0,foo2,2009-01-02
2,2.0,0.0,foo3,2009-01-05
3,3.0,1.0,foo4,2009-01-06
4,4.0,0.0,foo5,2009-01-07


## Jow to deal When data is trapped in a series of list

In [4]:
df = pd.DataFrame({'col_one':[1,2,3], 'col_two':[[10, 40], [20, 50], [30, 60]]})
df

Unnamed: 0,col_one,col_two
0,1,"[10, 40]"
1,2,"[20, 50]"
2,3,"[30, 60]"


### Problem : Data in col_two is trapped in a series of lists

### Solution : Expand the series into a dataframe

In [5]:
df.col_two.apply(pd.Series)

Unnamed: 0,0,1
0,10,40
1,20,50
2,30,60


## Need to check if two series contains the same elements

#### Dont do this `df.A == df.B`

#### Do this - `df.A.equals(df.B)`

#### It works for Dataframes too. `df.equals(df2)`


`equals()` properly handles NaNs, whereas `==` does not.

***

- Calculate % of missing values in each column:  
`df.isna().mean()`

- Drop columns with any missing values:  
`df.dropna(axis='columns')`

- Drop columns in which more than 10% of values are missing:  
`df.dropna(thresh=len(df)*0.9, axis='columns')`


## Are your dataset rows spread across multiple files, but you need a single DataFrame?

Solution:  
1. Use glob() to list your files
2. Use a generator expression to read files and concat() to combine them

![](glob.jpg)