# Intro to Data Wrangling  

Data comes in many ways, from creating your own, to harvesting from other locations.  Each end of the spectrum has it's strengths and limitations.  The strength of creating your own is that you know EXACTLY how the data was collected, what the settings were, what the assumptions were, and when it was done.  The limitation is that it is often very costly, either in time, money or both, to collect your own. You might not have either the people or the tools to get the data, especially if it's scientific data.  On the other end, harvesting data from other's work is much easier, as they've already collected it.  What you don't know is what their assumptions are, when it was done, how it was done, or what any setting were, unless they are documented (rare).  But, it is much easier for you.  What happens the majority of the time is that you find several sets of data that you want to use, all of them collected separately, with different techniques, or assumptions or data collection processes.  Now it's your job to 'merge' them together - that is *Data Wrangling*.  

Data wrangling requires many skills - you need to know the material, you need to be able to make decisions, and you need to be curious, among many others.  One of the technical tools we will use to analyze the data is Python.  Why use Python instead of anything else?  One of Python's strengths is that it works with incomplete data (empty data fields) well, where many don't.

Depending on your desired end state, you might be responsible for collecting data from other sources, such as the web or data bases.  That data was built/collected with a focus that probably doesn't target your problem.  To make the data usable for you and keep your data integrity good, you'll have to take a deep dive into how the data was collected to make sure it doesn't assume away anything you might need.  Using visualization, you will get some clues into the data, but Python provides other tools that will help you with this.

In short, you will need to standardize your data, if your standard is in kilograms, and the data you want to add is measured in ounces, you'll have to do the mathematical conversion before you consolidate into your master dataset. You'll need to determine which pieces of which datasets you're going to keep, clean them up individually, standardize them, and then merge them - THEN analyze the entire lot.  From studies, data wrangling is 60+% of your time.   

Before that, let's characterize data, yet another way, but this perspective will help with programming.

## Numerical data
__Numerical__ data also known as quantitative variable are measured in numbers to indicatetheir amount and units to indicate the quantity. Remember measurement in physics,fundamental and derived quantities. Examples include time, temperature, currency,weight. Age, Years of experience, salary, price. They are all described by numbers and units. Units must be uniform before numerical data can be compared.

__Categorical__ data also referred to as qualitative variables refers to attributes or classeslike colours, states, size, car models, country, animal species, chemical compounds, classes of foods.

### Statistical levels of measurement ##

__Nominal__:  Most categorical data fall into the nominal level.  The categories are just names and classes with no progressive relationship between them.  Examples are names of cities or colors.  They are disjoint and every observation falls into one category.  Categorical data (nominal data) can be assigned numerical values and encoded as numbers for classification algorithms.  These numeric values are arbitrary and arithmetic operations or comparisons including ordering cannot be performed on the encoding numbers.  Only frequency counts can be performed.

__Ordinal__:  Ordinal levels are categorical but ordering can be performed on them. So when encodedusing numbers, progressive or consecutive numbers must be used because the ordermatters. A common example is user feedbacks; very bad, bad, indifferent, good, verygood which is usually encoded as 1 = very bad, 2 = bad, 3 = indifferent, 4 = good, 5 =very good. Another good example is educational level; less than high school, high schoolgraduate, Bachelors degree, Masters degree, Doctorate degree. This measurement isprogressive. However, we cannot tell anything about the nature of difference betweeneach category and so arithmetic operations cannot be performed since the scaledifference cannot be interpreted.

__Interval__:  This level is for numerical data only, it has regular, constant intervals like a scale ornumber lines. This scale has a zero point but this zero point does not represent theabsence of that value. The values here are relative and taken with reference to a point,which is usually the zero point. Example is temperature measured in Celsius; heightmeasured with reference to sea level; date and years measured as BC and AD. For thismeasurement level, division and multiplications by each other are meaningless, onlydifferences between them are relevant. Mean, standard deviation, range and some otherstatistical analysis can be performed. Values can also be sorted.

__Ratio__:  Numerical values with absolute zeros, that is zero score represents the total absence ofthat quantity, are classified as ratio level of measurements. These values have the samecharacteristics as real numbers. Example include price, age, salary, height and weight of an object.

***

Computers organize data into types, and work their functions based on the specific data type.  In Python pandas, the data types are known as _dtypes_.  They include:

__float__ (float64) which represents decimal numbers such as 1.0

__int__ (int64) which represents integers such as 1

__datetime__ (datetime64[ns]) is used to represent data, time and time differences

__string__ (object) which includes alphabetic strings and combinations of multiple datatypes.

***

In Module 2 and 3, we'll look at how to modify the data to standardize it, and address empty, spurious or duplicate data.

Let's explore a data set with lots of obvious problems.

First, let's talk about the new data set, from the website Kaggle. It's a Russian real estate data from one of their premier banks. It was used in a Data Science competition to predict the price of specific apartments based on lots of parameters. It is the Russian housing dataset, from Kaggle. It has 2 sets of data, a training set and a test set, both of which are posted on BB.  This data set was broken into 2 pieces, one to train a model and a unique one to test the model.  This technique will be introduced in Module 4, but until then, we can use the data.

This is a decent size data set, 30471 rows and 292 columns.  I did post the data_dictionary.txt which describes what each column means.

Now let's look at what the sample content is.

__Example 1:__


In [4]:
import pandas as pd 
data = pd.read_csv("train.csv") 
display(data.head(10))    # view the first 10 rows of data

Unnamed: 0,id,timestamp,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
0,1,2011-08-20,43,27.0,4.0,,,,,,...,9,4,0,13,22,1,0,52,4,5850000
1,2,2011-08-23,34,19.0,3.0,,,,,,...,15,3,0,15,29,1,10,66,14,6000000
2,3,2011-08-27,43,29.0,2.0,,,,,,...,10,3,0,11,27,0,4,67,10,5700000
3,4,2011-09-01,89,50.0,9.0,,,,,,...,11,2,1,4,4,0,0,26,3,13100000
4,5,2011-09-05,77,77.0,4.0,,,,,,...,319,108,17,135,236,2,91,195,14,16331452
5,6,2011-09-06,67,46.0,14.0,,,,,,...,62,14,1,53,78,1,20,113,17,9100000
6,7,2011-09-08,25,14.0,10.0,,,,,,...,81,16,3,38,80,1,27,127,8,5500000
7,8,2011-09-09,44,44.0,5.0,,,,,,...,9,4,0,11,18,1,0,47,4,2000000
8,9,2011-09-10,42,27.0,5.0,,,,,,...,19,8,1,18,34,1,3,85,11,5300000
9,10,2011-09-13,36,21.0,9.0,,,,,,...,19,13,0,10,20,1,3,67,1,2000000


The variable.head(10) shows the first 10 rows of data, where whatever is in the paranthesis is the number of rows shown.  Obviously, we can't see all 292 columns.  The variable.info() command shows each column, their datatype and the number of non-null values (*NaN* which stands for Not a Number), which are what is of interest to us - those without values.

__Example 2:__

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Columns: 292 entries, id to price_doc
dtypes: float64(119), int64(157), object(16)
memory usage: 67.9+ MB


From looking at this, the data set is too large (30471 rows x 292 columns) to get a summary of each row/column, so we'll need to look at what columns we think are interesting.  To put each column, we use the format data[['columnname']].info(), where columnname is the column label.  In this example, we're looking at the total square meters of the space.

__Example 3:__

In [10]:
data[['full_sq']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   full_sq  30471 non-null  int64
dtypes: int64(1)
memory usage: 238.2 KB


From this, we can see that there are 30471 Non-Null (empty) values.  We remember that there are 30471 rows, and so every 
one has a value in it.  Let's look at a few others.

Let's say we had a client for whom the height of the building was important (max_floor), so the number of floors in the building was important, as well as the building year and kitchen size. 

__Example 4:__

In [12]:
data[['max_floor','build_year','kitch_sq']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30471 entries, 0 to 30470
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   max_floor   20899 non-null  float64
 1   build_year  16866 non-null  float64
 2   kitch_sq    20899 non-null  float64
dtypes: float64(3)
memory usage: 714.3 KB


From this, remembering that there are 30471 rows, we can see that there are 10k rows missing in max_floor and kitchen size (kitch_sq), and less than half report the year the building was built (build_year).

Using the data.duplicate() command will generate a True/False for whether there is a duplicate or not.  

__Example 5:__

In [14]:
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
30466    False
30467    False
30468    False
30469    False
30470    False
Length: 30471, dtype: bool

As each value is false, we can tell there aren't any duplicates in this data set.

Using data.duplicate().describe() shows us the unique values and frequency of the top values.  One unique value of False means no duplicate data and two unique values of false means there IS a duplicate and subtracting the frequency of the top count from the total we can tell the number of duplicate values.

__Example 6:__

In [15]:
data.duplicated().describe()


count     30471
unique        1
top       False
freq      30471
dtype: object

Unfortunately, with this large dataset, it doesn't help us much to do it across 
the entire set, we need to pick and choose our parameters.

Using data.describe(), we can look at the 'big picture' values on each of the parameters (columns).  From these, we can (probably) determine whether we have data entry errors and/or outliers.

__Example 7:__

In [9]:
data.describe()


Unnamed: 0,id,full_sq,life_sq,floor,max_floor,material,build_year,num_room,kitch_sq,state,...,cafe_count_5000_price_2500,cafe_count_5000_price_4000,cafe_count_5000_price_high,big_church_count_5000,church_count_5000,mosque_count_5000,leisure_count_5000,sport_count_5000,market_count_5000,price_doc
count,30471.0,30471.0,24088.0,30304.0,20899.0,20899.0,16866.0,20899.0,20899.0,16912.0,...,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0,30471.0
mean,15237.917397,54.214269,34.403271,7.670803,12.558974,1.827121,3068.057,1.909804,6.399301,2.107025,...,32.058318,10.78386,1.771783,15.045552,30.251518,0.442421,8.648814,52.796593,5.98707,7123035.0
std,8796.501536,38.031487,52.285733,5.319989,6.75655,1.481154,154387.8,0.851805,28.265979,0.880148,...,73.465611,28.385679,5.418807,29.118668,47.347938,0.609269,20.580741,46.29266,4.889219,4780111.0
min,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100000.0
25%,7620.5,38.0,20.0,3.0,9.0,1.0,1967.0,1.0,1.0,1.0,...,2.0,1.0,0.0,2.0,9.0,0.0,0.0,11.0,1.0,4740002.0
50%,15238.0,49.0,30.0,6.5,12.0,1.0,1979.0,2.0,6.0,2.0,...,8.0,2.0,0.0,7.0,16.0,0.0,2.0,48.0,5.0,6274411.0
75%,22855.5,63.0,43.0,11.0,17.0,2.0,2005.0,2.0,9.0,3.0,...,21.0,5.0,1.0,12.0,28.0,1.0,7.0,76.0,10.0,8300000.0
max,30473.0,5326.0,7478.0,77.0,117.0,6.0,20052010.0,19.0,2014.0,33.0,...,377.0,147.0,30.0,151.0,250.0,2.0,106.0,218.0,21.0,111111100.0


***
__Question 1:__  Let's look at the range of sizes in this apartments.  What is the size of the largest?  Let's assume all the maximums belong to a single apartment.  What is the size of the median apartment?  And the smallest?  These are given in square meters, so you'll have to convert to square feet.  What are the sizes of the 25th and 75th percentile?

***

Let's look at the values of _max_floor_.  The mean is (roughly) 12 floors, using common sense, not an unusual size.  But look at the max - 117 floors?  What is the highest # of floors that are in the largest cities?  Anything above 100 is very distinctive and well known - look up Russian buildings, and do you see many large ones?  I would check the data values for anything over about 80 floors.
Again, you can use the parameter names to pull out what you think is important.  

In [1]:
data[['max_floor','build_year','kitch_sq']].describe()

NameError: name 'data' is not defined

In looking at these, I would definitely check the kitchen size - 25% of the kitchen sizes are 1 square meter or less?  Maybe they are all in efficiency apartments, but that would be an interesting look.  

What to do with these empty values and duplicates will be addressed in a future lesson.

__Exercise 1:__  Many Russians walk, and so a 3km zone was delinated for this data set.  *church_count_3000* is the number of churches within 3000m of each of these apartments.  Determine how many are reported, and generate a data analysis of these values.

__Exercise 2:__ Cheap eats might be important.  *cafe_count_5000_na_price* is the number of cafes within a 5km zone that doesn't have a minimum bill.  Do the numbers on these.

__Question 2:__ Many of these values are missing.  What can you do about the missing numbers?  We'll cover the possiblities later, but what do you think we can do about them?

