## Chapter 16: Data, Data everywhere<br><small>November 9, 2016</small>

### Parametric Data Types <small>(From Chapter 8)</small>

In [1]:
type Polynomial{T}
    coeffs::Vector{T}
end

In [2]:
poly1=Polynomial([1,2,3])

Polynomial{Int64}([1,2,3])

In [3]:
poly2=Polynomial([2.0,-3.0,0.0])

Polynomial{Float64}([2.0,-3.0,0.0])

In [4]:
poly3=Polynomial(["c","b","a"])

Polynomial{String}(String["c","b","a"])

In [5]:
function printPolygon(p::Polynomial)
    str = ""
    for i = 1:length(p.coeffs)
        str = string(str, p.coeffs[i], "x^", i-1, i<length(p.coeffs)?"+":"")
    end
    str
end

printPolygon (generic function with 1 method)

In [6]:
printPolygon(poly1)

"1x^0+2x^1+3x^2"

In [7]:
printPolygon(poly2)

"2.0x^0+-3.0x^1+0.0x^2"

In [8]:
printPolygon(poly3)

"cx^0+bx^1+ax^2"

In [9]:
import Base.+

In [10]:
function +{T1<:Number,T2<:Number}(p1::Polynomial{T1}, p2::Polynomial{T2})
    Polynomial(p1.coeffs + p2.coeffs)
end

+ (generic function with 164 methods)

In [11]:
methods(+);

In [12]:
poly1+poly2

Polynomial{Float64}([3.0,-1.0,3.0])

In [13]:
printPolygon(poly1+poly2)

"3.0x^0+-1.0x^1+3.0x^2"

In [14]:
poly1+poly3

LoadError: LoadError: MethodError: no method matching +(::Polynomial{Int64}, ::Polynomial{String})
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:138
  +{T1<:Number,T2<:Number}(::Polynomial{T1<:Number}, !Matched::Polynomial{T2<:Number}) at In[10]:2
while loading In[14], in expression starting on line 1

### Data and Data Frames <small>Back to this chapter</small>
<hr>
**DataArray**  
- Single Type
- Handle Missing Data

In [15]:
using DataFrames

In [16]:
NA

NA

In [17]:
NA+1

NA

In [18]:
2*NA

NA

In [19]:
NA/NA

NA

In [20]:
z=NA

NA

In [21]:
typeof(z)

DataArrays.NAtype

Building a DataArray

In [22]:
data = @data([1,2,3,4])

4-element DataArrays.DataArray{Int64,1}:
 1
 2
 3
 4

In [23]:
mean(data)

2.5

In [24]:
std(data)

1.2909944487358056

In [25]:
data1 = @data([1,2,3,4,NA])

5-element DataArrays.DataArray{Int64,1}:
 1  
 2  
 3  
 4  
  NA

In [26]:
mean(data1)

NA

In [27]:
dropna(data1)

4-element Array{Int64,1}:
 1
 2
 3
 4

In [28]:
sum(dropna(data1))

10

In [29]:
df=DataFrame(name=["A","B","C","D","E","F"],age=[21,27,13,41,33,48])

Unnamed: 0,name,age
1,A,21
2,B,27
3,C,13
4,D,41
5,E,33
6,F,48


In [30]:
mean(df[:age]) # get mean of ages in dataframe above

30.5

In [31]:
mean(df[:,2]) # same thing if you know what column number contains ages.

30.5

In [32]:
df[:age] # output ages

6-element DataArrays.DataArray{Int64,1}:
 21
 27
 13
 41
 33
 48

In [33]:
typeof(df[:age])

DataArrays.DataArray{Int64,1}

In [34]:
typeof(df[:name])

DataArrays.DataArray{String,1}

Given the politics that are happening concurrently, let's look at some census data...

In [41]:
census_data=readtable("Gaz_ua_national.txt",separator='\t');

In [42]:
head(census_data)

Unnamed: 0,GEOID,NAME,UATYPE,POP10,HU10,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
1,37,"Abbeville, LA Urban Cluster",C,19824,8460,29222871,300497,11.283,0.116,29.967602,-92.098219
2,64,"Abbeville, SC Urban Cluster",C,5243,2578,11315197,19786,4.369,0.008,34.179237,-82.379726
3,91,"Abbotsford, WI Urban Cluster",C,3966,1616,5363441,13221,2.071,0.005,44.948612,-90.315875
4,118,"Aberdeen, MS Urban Cluster",C,4666,2050,7416616,52732,2.864,0.02,33.824742,-88.554591
5,145,"Aberdeen, SD Urban Cluster",C,25977,12114,33002447,247597,12.742,0.096,45.463186,-98.471033
6,172,"Aberdeen, WA Urban Cluster",C,29856,13139,39997951,1929689,15.443,0.745,46.976365,-123.796056


In [43]:
typeof(census_data[:GEOID])

DataArrays.DataArray{Int64,1}

**Here’s a few questions we may want to know:**

1. What are the top 10 areas in population
1. Give a histogram plot in terms of population? (What are good bin sizes?)
1. What is the total population of all areas?
1. What the top 10 area in housing units?
1. What is the total number of housing units?
1. What is the average number of people per housing units for all areas?
1. For the top 10 area in population, find the average number of people per housing unit?
1. What are the top 10 areas in land size?
1. What are the top 10 areas in water size?
1. What are the Massachusetts areas in the data?
1. What is the average population, median and standard deviation of the areas?

Let's put this in a  reasonable type format

In [37]:
census_data=readtable("Gaz_ua_national.txt",separator='\t',eltypes=[String,String,String,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Float64]);

In [38]:
head(census_data)

Unnamed: 0,GEOID,NAME,UATYPE,POP10,HU10,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
1,37,"Abbeville, LA Urban Cluster",C,19824,8460,29222871,300497,11.283,0.116,29.967602,-92.098219
2,64,"Abbeville, SC Urban Cluster",C,5243,2578,11315197,19786,4.369,0.008,34.179237,-82.379726
3,91,"Abbotsford, WI Urban Cluster",C,3966,1616,5363441,13221,2.071,0.005,44.948612,-90.315875
4,118,"Aberdeen, MS Urban Cluster",C,4666,2050,7416616,52732,2.864,0.02,33.824742,-88.554591
5,145,"Aberdeen, SD Urban Cluster",C,25977,12114,33002447,247597,12.742,0.096,45.463186,-98.471033
6,172,"Aberdeen, WA Urban Cluster",C,29856,13139,39997951,1929689,15.443,0.745,46.976365,-123.796056


In [40]:
typeof(census_data[:GEOID])

DataArrays.DataArray{String,1}

Much better

In [44]:
sum(census_data[:POP10])

252746527