## 1 - Dataframe with one row before the header 

In [11]:
import pandas as pd
df = pd.read_csv('stock_data_1.csv')
df

Unnamed: 0,stocks data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,tickers,eps,revenue,price,people
1,GOOGL,27.82,87,845,larry page
2,WMT,4.61,484,65,n.a.
3,MSFT,-1,85,64,bill gates
4,RIL,not available,50,1023,mukesh ambani
5,TATA,5.6,-1,n.a.,ratan tata


This Dataframe has one word (stocks data) write in the first line of the archive, and its columns names are counting as the first line of values, this is incorrect.

In [12]:
df = pd.read_csv('stock_data_1.csv', skiprows = 1)
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


When you skip a row you will see that the word 'stocks data' is not there

In [13]:
df = pd.read_csv('stock_data_1.csv', header = 1) #You say the header is in the first row
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


## 2 - Missing header in the Dataframe (csv.file)

In [14]:
df = pd.read_csv('stock_data_2.csv')
df

Unnamed: 0,GOOGL,27.82,87,845,larry page
0,WMT,4.61,484,65,n.a.
1,MSFT,-1,85,64,bill gates
2,RIL,not available,50,1023,mukesh ambani
3,TATA,5.6,-1,n.a.,ratan tata


In [15]:
#Insert a header
df = pd.read_csv('stock_data_2.csv', header = None)
df

Unnamed: 0,0,1,2,3,4
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


In [16]:
#Typing the headers' names
df = pd.read_csv('stock_data_2.csv', header = None, names = ['ticker', 'eps', 'revenue', 'price', 'people'])
df

Unnamed: 0,ticker,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


## 3 - Reading a number of rows

In [18]:
df = pd.read_csv('stock_data.csv', nrows = 3)
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1.0,85,64,bill gates


## 4 - Read some data as NaN

This is good for dealing with messy data

In [20]:
df = pd.read_csv('stock_data.csv', na_values = ['not available', 'n.a.'])
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845.0,larry page
1,WMT,4.61,484,65.0,
2,MSFT,-1.0,85,64.0,bill gates
3,RIL,,50,1023.0,mukesh ambani
4,TATA,5.6,-1,,ratan tata


Looking at our DatFrame we notice that there is a -1 value in the column revenue. The problem is, this is not possible, it doesn't exist revenue negative value.


In [21]:
df = pd.read_csv('stock_data.csv', na_values = ['not available', 'n.a.', -1])
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845.0,larry page
1,WMT,4.61,484.0,65.0,
2,MSFT,,85.0,64.0,bill gates
3,RIL,,50.0,1023.0,mukesh ambani
4,TATA,5.6,,,ratan tata


Now we changed the -1 inside earning per stock, but we know that this could be negative, so for changing a value just in the columns that I want (revenue, in this case) its usefull to use a dictionary

In [22]:
df = pd.read_csv('stock_data.csv', na_values = ['not available', 'n.a.'])
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845.0,larry page
1,WMT,4.61,484,65.0,
2,MSFT,-1.0,85,64.0,bill gates
3,RIL,,50,1023.0,mukesh ambani
4,TATA,5.6,-1,,ratan tata


In [23]:
df = pd.read_csv('stock_data.csv', na_values = {
    'eps': ['not available', 'n.a.'],
    'revenue' : ['not available', 'n.a.', -1],
    'people' : ['not available', 'n.a.']
    })
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845,larry page
1,WMT,4.61,484.0,65,
2,MSFT,-1.0,85.0,64,bill gates
3,RIL,,50.0,1023,mukesh ambani
4,TATA,5.6,,n.a.,ratan tata


## 5 - Creating a new csv file

In [25]:
#I'm writing this cleaning dataframe to another csv file
#This command will create a new file in your directory
df.to_csv('new.csv', index = False) 
#Its default has the index, but I don't want it

In [26]:
#Selecting the columns that I want to save in the new file
df.columns

Index(['tickers', 'eps', 'revenue', 'price', 'people'], dtype='object')

In [27]:
df.to_csv('new.csv', columns = ['tickers', 'eps'])

In [28]:
#Skip the header
df.to_csv('new.csv', header = False) 