## Dealing with DataFrames
Here we simplify how to deal with Pandas-DataFrames (header, skiprows, nrows, messy data, ...)

In [1]:
import pandas as pd
df = pd.read_csv('CSV Files/stock_data.csv')
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


In [14]:
# Now!, sometimes we have an extra header in our CSV file, but in Jupyter we want to skip the header, so for that we can use 'skiprows' or header:
#df1 = pd.read_csv('CSV Files/stock_data.csv', skiprows = 1)
# Or
df1 = pd.read_csv('CSV Files/stock_data.csv', header = 2)
df1
# You can define how many rows do you wan to skip (skiprows = n, header = n)

Unnamed: 0,WMT,4.61,484,65,n.a.
0,MSFT,-1,85,64,bill gates
1,RIL,not available,50,1023,mukesh ambani
2,TATA,5.6,-1,n.a.,ratan tata


In [15]:
# If you don't have header in your CSV file:
df2 = pd.read_csv('CSV Files/stock_data.csv', header = None)
df2

Unnamed: 0,0,1,2,3,4
0,tickers,eps,revenue,price,people
1,GOOGL,27.82,87,845,larry page
2,WMT,4.61,484,65,n.a.
3,MSFT,-1,85,64,bill gates
4,RIL,not available,50,1023,mukesh ambani
5,TATA,5.6,-1,n.a.,ratan tata


In [16]:
# So you can add header to your CSV file by your ownself:
df2 = pd.read_csv('CSV Files/stock_data.csv', header = None, names=["tickers","eps", "revenue", "price", "people"])
df2

Unnamed: 0,tickers,eps,revenue,price,people
0,tickers,eps,revenue,price,people
1,GOOGL,27.82,87,845,larry page
2,WMT,4.61,484,65,n.a.
3,MSFT,-1,85,64,bill gates
4,RIL,not available,50,1023,mukesh ambani
5,TATA,5.6,-1,n.a.,ratan tata


In [22]:
# If your dataset is prety big, and you want to print a specific rows, so you can use 'nrow' & pass the number of rows to it:
df3 = pd.read_csv('CSV Files/stock_data.csv', nrows = 3)
df3

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1.0,85,64,bill gates


In [32]:
# We can provide specific values for unavailable cells,it's pretty good for analysing messy data:
# So here we say that if in my dataset there were n.a. vaues, then replace it by NaN. automatically it replace n.a. values on NaN.
df4 = pd.read_csv('CSV Files/stock_data.csv', na_values = ["n.a."])
df4
# So now you can see two values are replaced.

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845.0,larry page
1,WMT,4.61,484,65.0,
2,MSFT,-1,85,64.0,bill gates
3,RIL,not available,50,1023.0,mukesh ambani
4,TATA,5.6,-1,,ratan tata


In [33]:
# Here we say that if you find n.a. and not available cells in my dataset, then replace it with NaN:
df5 = pd.read_csv('CSV Files/stock_data.csv', na_values = ["n.a.", "not available"])
df5
# You can see three values are replaced.

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845.0,larry page
1,WMT,4.61,484,65.0,
2,MSFT,-1.0,85,64.0,bill gates
3,RIL,,50,1023.0,mukesh ambani
4,TATA,5.6,-1,,ratan tata


In [34]:
# Again if we look to our dataset we have -1 value for revenue column which is incorrect (we can't have -1 value for revenue), so if we add -1 with the previous defined df5 DataFrame, it will also change the -1 value which is in eps column. So to cover this issue, we should define separate values for each column:
df6 = pd.read_csv('CSV Files/stock_data.csv', na_values = {
    "eps": ["n.a.", "not available"],
    "revenue": [-1],
    "price": ["n.a.", "not available"],
    "people": ["n.a.", "not available"]
})
df6
# So we see it works fine.

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845.0,larry page
1,WMT,4.61,484.0,65.0,
2,MSFT,-1.0,85.0,64.0,bill gates
3,RIL,,50.0,1023.0,mukesh ambani
4,TATA,5.6,,,ratan tata


- So thats were all about basic dealing with DataFrames...