<h1>Reading and Cleaning Data</h1>
<p>In this section we will discuss how to import files into python and how to clean and modify these files. A key point to reading files is to remember that the directory you are working in matters.</p>

In [1]:
#calling a library

import pandas as pd

#Alias -> 
#you can call libraries as aliases using "as". This will allow you to simplify your code and the amount of typing that you
#need to do.

<h3> Read Options</h3>
<ul><li>read_csv -- Load delimited data from a file, URL, or file-like object; use comma as default delimiter</li>
<li>read_fwf -- Read data in fixed-width column format (i.e., no delimiters)</li>
<li>read_excel -- Read tabular data from an Excel XLS or XLSX file</li>
<li>read_html -- Read all tables found in the given HTML document</li>
<li>read_json -- Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object</li>
<li>read_sas -- Read a SAS dataset stored in one of the SAS system’s custom storage formats</li>
<li>read_spss -- Read a data file created by SPSS</li>
<li>read_stata -- Read a dataset from Stata file format</li>
<li>read_xml -- Read a table of data from an XML file</li></ul>

In [None]:
pip install pyreadstat

In [None]:
import pyreadstat

<h3>ICPSR Example</h3>
<ul><li><a href ="https://databases.lib.wvu.edu/connect/1360614561">ICPSR -Inter-University Consortium for Political and Social Research</a></li></ul>

In [2]:
anes_2020 = pd.read_spss("38034-0001-Data.sav")
anes_2020

Unnamed: 0,VERSION,V200001,V160001_ORIG,V200002,V200003,V200004,V200005,V200010A,V200010B,V200010C,...,V202626,V202627,V202628,V202629,V202630,V202631,V202632,V202633,V202634,V202635
0,ANES2020TimeSeries_20210324,200015.0,401318.0,3. Web,2. ANES 2016-2020 Panel,-2. Data will be available as part of the full...,0. No special concern,0.827932,0.611133,2.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
1,ANES2020TimeSeries_20210324,200022.0,300261.0,3. Web,2. ANES 2016-2020 Panel,-2. Data will be available as part of the full...,0. No special concern,1.087641,1.209783,2.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
2,ANES2020TimeSeries_20210324,200039.0,400181.0,3. Web,2. ANES 2016-2020 Panel,-2. Data will be available as part of the full...,0. No special concern,0.671765,0.823936,1.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
3,ANES2020TimeSeries_20210324,200046.0,300171.0,3. Web,2. ANES 2016-2020 Panel,-2. Data will be available as part of the full...,0. No special concern,0.491910,0.512837,2.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
4,ANES2020TimeSeries_20210324,200053.0,405145.0,3. Web,2. ANES 2016-2020 Panel,-2. Data will be available as part of the full...,1. Some concern (possible substitution),1.189965,0.856575,1.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8275,ANES2020TimeSeries_20210324,535315.0,"-1. Inapplicable, not a re-interview case",1. Video,"6. 3C Fresh sample: video, web, or phone",-2. Data will be available as part of the full...,0. No special concern,0.996324,1.480103,1.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
8276,ANES2020TimeSeries_20210324,535360.0,"-1. Inapplicable, not a re-interview case",1. Video,"6. 3C Fresh sample: video, web, or phone",-2. Data will be available as part of the full...,0. No special concern,1.533625,1.503653,2.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
8277,ANES2020TimeSeries_20210324,535414.0,"-1. Inapplicable, not a re-interview case",2. Telephone,"6. 3C Fresh sample: video, web, or phone",-2. Data will be available as part of the full...,0. No special concern,2.043217,1.150732,1.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable
8278,ANES2020TimeSeries_20210324,535421.0,"-1. Inapplicable, not a re-interview case",3. Web,"6. 3C Fresh sample: video, web, or phone",-2. Data will be available as part of the full...,0. No special concern,0.366220,0.281583,2.0,...,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable,-1. Inapplicable


In [None]:
#Importing Datasets into python

bigfoot = pd.read_csv('bigfoot.csv')
%whos

In [None]:
#Viewing dataframes .head() . tail()

bigfoot.head() #default is 5

#bigfoot.tail(10) #change the number displayed

In [None]:
#viewing data values in dataframe

bigfoot.dtypes

In [None]:
#viewing observations and variables numbers

bigfoot.shape

In [None]:
#viewing the names of variables

bigfoot.columns.values

<h1>Cleaning</h1>

<h3>Why would you need to clean data</h3>
<ul>
    <li>Data in columns and rows are not ordered in the correct way</li>
    <li>Creating values or ignoring missing data</li>
    <li>Units are not correct or are wrong in some way</li>
    <li>Order of magnitude is off</li>
    <li>Outliers and skewing of the data</li>
    </ul>

In [None]:
#remove all observations with na

bigfoot_cleaned = bigfoot.dropna()
bigfoot_cleaned.head()

In [None]:
%whos

In [None]:
#replace na values

bigfoot.fillna(999, inplace = True)
bigfoot.head()

In [None]:
#get value counts for a variable

bigfoot["latitude"].value_counts()

<h3>Boolean Operators</h3>
<p>Use comparison operators to determine to filter observations in a variable.</p>
<ul style>
    <li>Equal ( == )</li>
    <li>Not equal ( != )</li>
    <li>Greater than ( > )</li>
    <li>Less than ( < )</li>
    <li>Greater than or equal ( >= )</li>
    <li>Less than or equal ( <= )</li>
    </ul>

In [None]:
#filter observations using boolean operators

bigfoot = bigfoot[bigfoot["latitude"]!= 999.00000]
bigfoot["latitude"].value_counts()

can we find the date

<h2>Manipulating</h2>
<p>fixing data and manipulating varaibles</p>

In [None]:
demo = pd.read_csv('demo.csv')
demo.columns

In [None]:
#Recoding

#let's use the value_count function to view a variable

demo["gender"].value_counts() # what if they are not coded correctly

In [None]:
#Changing case values

#lower the case with -- demo["gender"].str.lower()

#demo["gender"] = demo["gender"].str.lower()

demo["gender"] = demo["gender"].str.title()

demo["gender"].value_counts()

In [None]:
#recode

#.loc[] allows us to locate values in the variable

#str.contains allows us to locate information based on a criteria that we give and then replace it

demo.loc[demo["gender"].str.contains("F"), "gender"] = "Female"
demo.loc[demo["gender"].str.contains("M"), "gender"] = "Male"
demo["gender"].value_counts()

In [None]:
#subset

gender = demo["gender"]
gender.head()

In [None]:
#Subset multiple

gender_income = demo[["gender", "income"]]
gender_income

In [None]:
#filter values

above_35 = demo[demo["income"] > 35]
above_35.mean()

In [None]:
#Sort

demo.sort_values(by="income")


demo.sort_values(by=['income', 'age'], ascending=False).head()

In [None]:
#Pivot Table

demo.pivot_table(
    values="age", index="inccat", columns="ed", aggfunc="mean"
)

In [None]:
#Write

demo.to_csv("demo_from_python.csv")