# Basics of Exploratory Analysis

Importing the packages

In [1]:
import pandas as pd

Loading the data from csv file into pandas dataframe

In [2]:
cereal_df = pd.read_csv('cereal.csv')

In [3]:
cereal_df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,.,10.0,no info,6,280,25,3,1,0.33,68.402973
1,100% Natural Bran,Q,no info,120,3,5,15,2.0,8,8,135,0,3,1,1.0,33.983679
2,All-Bran,no info,C,70,4,1,260,9.0,7,5,320,25,.,1,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,no info,140,14.0,8,0,330,25,3,1,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14,8,-1,25,3,no info,0.75,34.384843


Strings in the dataframes are by default object type.

In [4]:
print('Protein column is :', cereal_df['protein'].dtype)

print('Name column is :', cereal_df['name'].dtype)
print('Rating column is :', cereal_df['rating'].dtype)
print('Type column is :', cereal_df['type'].dtype)

Protein column is : int64
Name column is : object
Rating column is : object
Type column is : object


**Missing values in the dataset are encoded as 'no_info' or '.', both of which are string values. This is not ideal for columns having string data type **


In [5]:
print(cereal_df['weight'].dtype)
print(cereal_df['sodium'].dtype)

object
object


**fat **column seems to be having integers still its datatype is object. Why?

Because there are missing values encoded as strings.

Data type of **fiber **column is: float

In [8]:
cereal_df['fiber'].dtypes

dtype('float64')

Data type of **calories **column is: int

In [7]:
cereal_df['calories'].dtypes

dtype('int64')

**Conclusion: **

**1) read_csv() function reads the first column of a csv file as header. **

**2) It can infer the datatypes of your columns quite well. ** 

### Dealing with Missing values and incorrect data dypes

In pandas, columns with a string value are stored as type object by default. Because missing values in this dataset appear to be encoded as either 'no info' or '.', both string values

In [6]:
print(cereal_df['fat'].dtypes)

object


When the column's data type is an object, doing simple arithmetic results in unexpected results. This sort of behavior can be problematic when doing all sorts of tasks—visualizing distributions, finding outliers, training models—because you expect Python to treat numbers as numbers.

In [7]:
print('First row value: ',cereal_df['fat'][0])
print('Second row value: ',cereal_df['fat'][1])

First row value:  1
Second row value:  5


In [8]:
# Adding both the values results in bad number because inputs are not treated as numbers but as strings 
cereal_df['fat'][0] + cereal_df['fat'][1]

'15'

**If you find the above result fine, you need to check your math ASAP. **

Ideally, the fat column should be treated as type int64 or float64, and missing data should be encoded as NaN.

Instead of parsing through each column and replacing 'no info' and '.' with NaN values after the dataset is loaded, you can use the na_values argument to account for those before it's loaded:

In [9]:
cereal_df2 = pd.read_csv("cereal.csv", na_values = ['no info', '.'])
cereal_df2.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1.0,,10.0,,6,280.0,25,3.0,1.0,0.33,68.402973
1,100% Natural Bran,Q,,120,3,5.0,15.0,2.0,8.0,8,135.0,0,3.0,1.0,1.0,33.983679
2,All-Bran,,C,70,4,1.0,260.0,9.0,7.0,5,320.0,25,,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,,140.0,14.0,8.0,0,330.0,25,3.0,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2.0,200.0,1.0,14.0,8,-1.0,25,3.0,,0.75,34.384843


In [10]:
type(cereal_df2['shelf'][2])

numpy.float64

In [11]:
cereal_df2.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1.0,,10.0,,6,280.0,25,3.0,1.0,0.33,68.402973
1,100% Natural Bran,Q,,120,3,5.0,15.0,2.0,8.0,8,135.0,0,3.0,1.0,1.0,33.983679
2,All-Bran,,C,70,4,1.0,260.0,9.0,7.0,5,320.0,25,,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,,140.0,14.0,8.0,0,330.0,25,3.0,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2.0,200.0,1.0,14.0,8,-1.0,25,3.0,,0.75,34.384843


In [12]:
cereal_df2['fat'][0] + cereal_df2['fat'][1]

6.0

Now the arthematic becomes right.

## Some useful read_csv functionalities:

**1) encoding **

Sometimes, the csv file format is encoded in different type of format. Specifing the encoding of the file being read helps read the file correctly.

Generally, we use encoding 'utf-8' or 'latin-1'

**2) Setting up header **

Some times the read_Csv function is not able to read the header as header but reads it as a row. And as a result of that, first row of the dataframe consists the header values. 

To correct it, we use **skiprows = 1 ** argument. This way, we can skip the no of rows specified from the start of the file from the dataframe. 

**3) Specifying the seperator **

Sometimes files are not csv but tsv, i.e, tab seperated file, or any other character seperated file. In order for this function to read the file correctly, we can specify the character which seperates the file columns manually.

This is specified by **sep = ';' ** argument. It tells in this example, that the file is seperated with ';' seperator and the text seperated by ';' should be treated as one column.

In [15]:
sample_Data.head()

NameError: name 'sample_Data' is not defined