<h1>Reading and Cleaning Data</h1>
<p>In this section we will discuss how to import files into python and how to clean and modify these files. A key point to reading files is to remember that the directory you are working in matters.</p>

In [None]:
#calling a library

import pandas as pd

#Alias -> 
#you can call libraries as aliases using "as". This will allow you to simplify your code and the amount of typing that you
#need to do.

In [None]:
#Importing Datasets into python

bigfoot = pd.read_csv('bigfoot.csv')
%whos

In [None]:
#Viewing dataframes .head() . tail()

bigfoot.head() #default is 5

#bigfoot.tail(10) #change the number displayed

In [None]:
#viewing data values in dataframe

bigfoot.dtypes

In [None]:
#viewing observations and variables numbers

bigfoot.shape

In [None]:
#viewing the names of variables

bigfoot.columns.values

<h1>Cleaning</h1>

<h3>Why would you need to clean data</h3>
<ul>
    <li>Data in columns and rows are not ordered in the correct way</li>
    <li>Creating values or ignoring missing data</li>
    <li>Units are not correct or are wrong in some way</li>
    <li>Order of magnitude is off</li>
    <li>Outliers and skewing of the data</li>
    </ul>

In [None]:
#remove all observations with na

bigfoot_cleaned = bigfoot.dropna()
bigfoot_cleaned.head()

In [None]:
%whos

In [None]:
#replace na values

bigfoot.fillna(999, inplace = True)
bigfoot.head()

In [None]:
#get value counts for a variable

bigfoot["latitude"].value_counts()

<h3>Boolean Operators</h3>
<p>Use comparison operators to determine to filter observations in a variable.</p>
<ul style>
    <li>Equal ( == )</li>
    <li>Not equal ( != )</li>
    <li>Greater than ( > )</li>
    <li>Less than ( < )</li>
    <li>Greater than or equal ( >= )</li>
    <li>Less than or equal ( <= )</li>
    </ul>

In [None]:
#filter observations using boolean operators

bigfoot = bigfoot[bigfoot["latitude"]!= 999.00000]
bigfoot["latitude"].value_counts()

can we find the date

<h2>Manipulating</h2>
<p>fixing data and manipulating varaibles</p>

In [None]:
demo = pd.read_csv('demo.csv')
demo.columns

In [None]:
#Recoding

#let's use the value_count function to view a variable

demo["gender"].value_counts() # what if they are not coded correctly

In [None]:
#Changing case values

#lower the case with -- demo["gender"].str.lower()

#demo["gender"] = demo["gender"].str.lower()

demo["gender"] = demo["gender"].str.title()

demo["gender"].value_counts()

In [None]:
#recode

#.loc[] allows us to locate values in the variable

#str.contains allows us to locate information based on a criteria that we give and then replace it

demo.loc[demo["gender"].str.contains("F"), "gender"] = "Female"
demo.loc[demo["gender"].str.contains("M"), "gender"] = "Male"
demo["gender"].value_counts()

In [None]:
#subset

gender = demo["gender"]
gender.head()

In [None]:
#Subset multiple

gender_income = demo[["gender", "income"]]
gender_income

In [None]:
#filter values

above_35 = demo[demo["income"] > 35]
above_35.mean()

In [None]:
#Sort

demo.sort_values(by="income")


demo.sort_values(by=['income', 'age'], ascending=False).head()

In [None]:
#Pivot Table

demo.pivot_table(
    values="age", index="inccat", columns="ed", aggfunc="mean"
)

In [None]:
#Write

demo.to_csv("demo_from_python.csv")