<h1>Reading and Cleaning Data</h1>
<p>In this section we will discuss how to import files into python and how to clean and modify these files. A key point to reading files is to remember that the directory you are working in matters.</p>

In [None]:
%whos

In [None]:
#calling a library

import pandas as pd
import numpy as np

#Alias -> 
#you can call libraries as aliases using "as". This will allow you to simplify your code and the amount of typing that you
#need to do.

<h3> Read Options</h3>
<ul><li>read_csv -- Load delimited data from a file, URL, or file-like object; use comma as default delimiter</li>
<li>read_fwf -- Read data in fixed-width column format (i.e., no delimiters)</li>
<li>read_excel -- Read tabular data from an Excel XLS or XLSX file</li>
<li>read_html -- Read all tables found in the given HTML document</li>
<li>read_json -- Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object</li>
<li>read_sas -- Read a SAS dataset stored in one of the SAS system’s custom storage formats</li>
<li>read_spss -- Read a data file created by SPSS</li>
<li>read_stata -- Read a dataset from Stata file format</li>
<li>read_xml -- Read a table of data from an XML file</li></ul>

<h3>ICPSR Example</h3>
<ul><li><a href ="https://databases.lib.wvu.edu/connect/1360614561">ICPSR -Inter-University Consortium for Political and Social Research</a></li></ul>

In [None]:
anes_2020 = pd.read_spss("38034-0001-Data.sav")
anes_2020

In [None]:
#Importing Datasets into python

bigfoot = pd.read_csv('bigfoot.csv')
%whos

In [None]:
#Viewing dataframes .head() . tail()

bigfoot.head() #default is 5

#bigfoot.tail(10) #change the number displayed

In [None]:
#viewing data values in dataframe

bigfoot.dtypes

In [None]:
#look as basic information about the dataframe with info()

bigfoot.info()

In [None]:
#viewing observations and variables numbers

bigfoot.shape

In [None]:
#viewing the names of variables

bigfoot.columns.values

<h1>Cleaning</h1>

<h3>Why would you need to clean data</h3>
<ul>
    <li>Data in columns and rows are not ordered in the correct way</li>
    <li>Creating values or ignoring missing data</li>
    <li>Units are not correct or are wrong in some way</li>
    <li>Order of magnitude is off</li>
    <li>Outliers and skewing of the data</li>
    </ul>

In [None]:
#check for missing data

bigfoot.isna()

In [None]:
#remove all observations with na

bigfoot_cleaned = bigfoot.dropna()
bigfoot_cleaned.head()

In [None]:
%whos

In [None]:
#replace na values

bigfoot.fillna(-999, inplace = True)
bigfoot.head()

In [None]:
#get value counts for a variable

bigfoot["date"].value_counts()

<h3>Boolean Operators</h3>
<p>Use comparison operators to determine to filter observations in a variable.</p>
<ul style>
    <li>Equal ( == )</li>
    <li>Not equal ( != )</li>
    <li>Greater than ( > )</li>
    <li>Less than ( < )</li>
    <li>Greater than or equal ( >= )</li>
    <li>Less than or equal ( <= )</li>
    </ul>

In [None]:
#filter observations using boolean operators

bigfoot = bigfoot[bigfoot["date"]!= -999]
bigfoot["date"].value_counts()

In [None]:
bigfoot.dtypes

In [None]:
#relabel values as datetime

bigfoot["date"] = pd.to_datetime(bigfoot["date"])
bigfoot.dtypes

In [None]:
#relable values as categorical

bigfoot["season"] = pd.Categorical(bigfoot["season"], ordered=False)
bigfoot.dtypes
bigfoot["season"]

<h2>Manipulating</h2>
<p>fixing data and manipulating varaibles</p>

In [None]:
demo = pd.read_csv('demo.csv')
demo.columns

In [None]:
#Recoding

#let's use the value_count function to view a variable

demo["gender"].value_counts() # what if they are not coded correctly

In [None]:
#use replace to recode variables

demo["gender"] = demo["gender"].replace(["m", "f"], ["Male", "Female"])
demo["gender"] = demo["gender"].replace(["male", "female"], ["Male", "Female"])

In [None]:
#create an array by subsetting

gender = demo["gender"]
gender.head()

In [None]:
#create a new dataframe with selected variables by subsetting

gender_income = demo[["gender", "income"]]
gender_income

In [None]:
#filter values

above_35 = demo[demo["income"] > 35]
above_35.mean()

In [None]:
#bin values

bins = [18, 25, 35, 60, 100]

demo["aged_binned"] = pd.cut(demo["age"], bins)

In [None]:
#sample variable

sample_age = np.random.permutation(demo["age"])
sample_age.mean()

In [None]:
#sample dataframe

demo_sample = demo.sample(n=500)
demo_sample.describe()

In [None]:
#dummy variables

season_dummies = pd.get_dummies(bigfoot["season"], prefix="season")
season_dummies

In [None]:
#add dummies to dataframe

bigfoot_dummies = bigfoot.join(season_dummies)
bigfoot_dummies

In [None]:
#Sort

demo.sort_values(by="income")


demo.sort_values(by=['income', 'age'], ascending=False).head()

In [None]:
#Pivot Table

demo.pivot_table(
    values="age", index="inccat", columns="ed", aggfunc="mean"
)

In [None]:
#Write

demo.to_csv("demo_from_python.csv")