<a href="https://colab.research.google.com/github/johnsl01/income/blob/master/incomereg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#_
_Multiple variable (hyperplane) regression_
Income data 


##_
_ml outline approach_

A structured ML student exercise type task typically comprises a set of data with one or more answers per data sample and a second set of data with the answers missing.

The basic task is to use machine learning to provide the missing answers.

The normal approach is : 

Explore and understand the data provided - typically producing some basic metrics around the data and preparing some simple correlations between elements of the data and the answers.

Assess the quality of the data - normally carried out as part of the exploration this task is more concentrating on the missing and bad data and any extreme outliers - and also looking at any unstructured data (free text strings etc) and determining what remediation is necessary to either impute the missing data - discard samples or manipulate the unstructured data to generate structured elements from it.

Carry out some initial ML tasks such as linear (planar) regression and assess the quality of the results on the known answers and provide an initial version of the answers to use as a benchmark for more sophisticated next steps.

Make an assesssment based on the information gathered above on the approach to ML - what types of ML tools are approriate to the data and estimate what level of quality could be achieved. For a student / training exercise the requirement may be to consider more than one aproach and to compare the results - or the requirement may be simply to produce the best result, and in this case there may be a submission so the 'real' answers are hidden and only a score available to compare results.

Typically the answers provded to ML problems fall in two categories - continuous numerical in which case the problem is generally a regression problem or categorical (ordered or not) in which case the problem is generally a categorisation problem, or one or more of each type. 

Student exercise can also contain more specific requirements - to use a specific type of ML approach and to demonstrate the impacts of some parameter changes or to examine the impact of under or over fitting to the test data.


##_
_discription and general notes_

y = B0 + B1.X1 + B2.X2 + B3.X3 ...

##_
imports and data loads

In [0]:
###############################################
#@title                 Imports               #
###############################################
print ("Imports")

import numpy as np

import pandas as pd

import sklearn

# need this as we have two categoricals with very large number of 
# categories (possible with very low variance among them)
# one hot - or other multi feature encoding will hugely increase
# the dimensionality.
!pip install category_encoders
import category_encoders as ce

# !pip install fancyimpute
# import fancyimpute as fi
# fancyimpute no longer includes mice - 
# but the sklearn implementation is experimantal

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

from google.colab import files

from datetime import datetime
print (datetime.now())



In [0]:
###############################################
#@title               Data loads              #
###############################################
print ("Data loads")
sourcedata = "https://raw.githubusercontent.com/johnsl01/income/master/incomeknown.csv"
incomeknown_df = pd.read_csv(sourcedata, sep=",")

sourcedata = "https://raw.githubusercontent.com/johnsl01/income/master/incomeunknown.csv"
incomeunknown_df = pd.read_csv(sourcedata, sep=",")

print (datetime.now())

In [0]:
# look at the data 
incomeknown_df.head(10)

In [0]:
# and the unknowns:
incomeunknown_df.head(10)

# looks similar - as to be hoped !

In [0]:
# dont like the col names - much easier with simple strings with no spaces.
incomeknown_df.rename(columns = {"Year of Record" : "Year",
                                 "Size of City" : "CitySize",
                                 "University Degree" : "Degree",
                                 "Wears Glasses" : "Glasses",
                                 "Hair Color" : "Hair",
                                 "Body Height [cm]" : "Height",
                                 "Income in EUR" : "Income"},
                      inplace = True)

incomeunknown_df.rename(columns = {"Year of Record" : "Year",
                                   "Size of City" : "CitySize",
                                   "University Degree" : "Degree",
                                   "Wears Glasses" : "Glasses",
                                   "Hair Color" : "Hair",
                                   "Body Height [cm]" : "Height",
                                   "Income in EUR" : "Income"},
                        inplace = True)

print (datetime.now())

In [0]:
# look again 
incomeknown_df.tail(10)

In [0]:
# and unknowns
incomeunknown_df.tail(10)

##_
_Initial data exploration_

In [0]:
incomeknown_df.describe()

# data is sort of crappy - normally if data doesn't make sense you question sources
# outliers - negative income - 14 age 115 age -  5M income 
# glasses almost exactly 50% ! - but does it correlate?  check later

# missing data in Year -- choose mean ?

# missing data in age -- impute with mice ?
# and do year the same way as well?

# need to check the missings in unknown - would be a bit mean to 
# have missings in other cols there

In [0]:
incomeunknown_df.describe()

# years and age also missing but in slightly different proportions
# others still all present - and comparable ranges - 

# note age outlier - older than any recorded human - 
# cannot treat data as real - just an academic exercise - or noise added very poorly.

# ok the numeric data is workable - still no view on whether it is good enough to give a result

# replacing missing data should be carried out across the entire data set 
# and not include the known income - this is just circular and makes the test/train more different! 

In [0]:
# and the categorical data :
incomeknown_df[['Gender','Country','Profession','Degree','Hair']].describe()

# a lot of missing data in Gender - and 5 values present - confirm the '3' are equal 
# - so check for 'rainbow' corelations
# treat missing as a 4th and check its correlation as well 

# Country - all present - note most frequent is a 'small' country 
# very wide range of categories here - treatment to be considered.
# but if it does contribute then how we encode it matters
# impractical to distribute across n features (as in 1-hot etc)
# no clear order determinable so a subset mean mechanism like James-Stein
# might work well - very tempting to consider encoding to two features
# mean and sd based - for any regression approach a mean^2 feature may help  
# looks like a ryo encoder - bah

# profession looks similar - with a few missing 

# (note from below - clearly artificial) 
# alphabetic application used - how does it correlate ?

# how about a James-Stein encoding in one feature and an additional 
# 1 hot encoding on the initial letter making 27 features (is this too many?) 

# degree - has missings and 5 levels - do missing simply constitute a 6th?

# hair - same

# check corelation - do missing actually make a feature ?
# 1 hot would do this and treat missing seperately.

# looking like about 50 wide at this point - too many dimensions
# really need to know how they matter.

# cpu and time are not an issue - just plug on 
# in real life I'd say go back and get good data


In [0]:
# and the unknowns
incomeunknown_df[['Gender','Country','Profession','Degree','Hair']].describe()
# similar Gender - but confirm the '3' match 
# and similar for Degree and Hair - critical to confirm the category match 

# Country and Profession - note another 'small' country and also the 'p' profession

# fewer countries and professions - but are there any in unknown that don't 
# exist in known - if so need to consider how to manage them for the encoding strategy
# sort of crappy problem - not the same as missing - not sure what to do
# except hope the answer is 'no' !





In [0]:
# look at the categorical data 
print ( incomeknown_df['Profession'].value_counts() )

print (len ( incomeknown_df.loc[incomeknown_df.Profession == 'other'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Profession == 'Other'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Profession == 'Unknown'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Profession == 'unknown'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Profession == '0'] ) )

print (len ( incomeknown_df.loc[incomeknown_df.Profession.isna()] ) )

In [0]:
print ( incomeknown_df['Country'].value_counts() )


print (len ( incomeknown_df.loc[incomeknown_df.Country == 'other'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Country == 'Other'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Country == 'Unknown'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Country == 'unknown'] ) )
print (len ( incomeknown_df.loc[incomeknown_df.Country == '0'] ) )

print (len ( incomeknown_df.loc[incomeknown_df.Country.isna()] ) )

In [0]:


print ( incomeknown_df['Degree'].value_counts() )

print ( incomeknown_df['Hair'].value_counts() )


print ( incomeknown_df['Gender'].value_counts() )

# note the highly artificial nature of the data 
# professions starting with p dominate the high frequencies
# along with a few q and r
# whereas those beginning with a, b or c dominate the lower frequencies.
# James Stein encoding - or a variant of it based on target mean may work well for this.

# if there is something silly going on such as the real feature being the initial letter then 
# this will still work - need to be be careful with the singletons.

# countries have no apparent pattern - treat as above.
# would be interesting to look at a mean and sd pair of features
# do we need to ammend with mean and sd of entire known sample?
# seems like that is weakening the feature - depends on its corelation with the result.

# comforting that Loas (see above) is high here - looks like semi-randomeness

# Real put-up job - 'analyse this, sucker!'

# sort of fishing in the dark here - the number of categories are very high.
# grouping them may be a better approach - but don't like that idea very much.


In [0]:
# and the unknowns 

print ( incomeunknown_df['Profession'].value_counts() )

print ( incomeunknown_df['Degree'].value_counts() )

print ( incomeunknown_df['Hair'].value_counts() )

print ( incomeunknown_df['Country'].value_counts() )

print ( incomeunknown_df['Gender'].value_counts() )

# really critical to know if there are any categorical values in unknowns that 
# are not present in knowns - and work out how to manage them.

# first need to work out if they exist - how ?

# - see France - doesn't look good - could well be categories unique to unknowns.

# Profession - the 'p' and (o) effect continues
# as does the abc at the bottom - and a '.' sneaked in !

# Degree and Hair are similar but both exhibit a switch in frequency order.
# In both cases the counts are close - but doesn't give
# confidence in the train test selection.

# real world - you can't always sample from the full population 
# but you need to know if you haven't.

# overall this is sort of crappy date - and I suspect that there 
# has been noise added and (see below on non-male/female) clearly some
# explicitely normal data inserted.


In [0]:
# prepare some simple histograms and scatter plots of numeric data against income
# looking for any obvious correlation

# exclude outliers (particulalry high incomes) to make the plots more useful

# 

# incomeknown_df.hist("Year", bins=40)
# incomeknown_df.loc[incomeknown_df.Income < 1000000].plot.scatter(x = "Year", y = "Income")

# print ("Scater plot of Income by Year")
plt.scatter(incomeknown_df.loc[incomeknown_df.Income < 60000]["Year"], incomeknown_df.loc[incomeknown_df.Income < 60000]["Income"], alpha=0.0075)
plt.title('Scatter plot of Income by Year')

incomeknown_df.loc[incomeknown_df.Income < 500000].hist("Income", bins=300)

# Age

incomeknown_df.hist("Age", bins=60)
incomeknown_df.loc[incomeknown_df.Income < 100000].plot.scatter(x = "Age", y = "Income", alpha=0.05)

# Height

incomeknown_df.hist("Height", bins=200)
incomeknown_df.loc[incomeknown_df.Income < 100000].plot.scatter(x = "Height", y = "Income", alpha=0.005)

# height and Gender

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                  (incomeknown_df.Gender == "male")].hist("Height", bins=300)

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                   (incomeknown_df.Gender == "male")].plot.scatter(x = "Height", y = "Income", c="LightBlue", alpha=0.01)

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                   (incomeknown_df.Gender == "female")].hist("Height", bins=300)

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                   (incomeknown_df.Gender == "female")].plot.scatter(x = "Height", y = "Income", c = 'Pink', alpha=0.01)

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                   ((incomeknown_df.Gender != "female") & (incomeknown_df.Gender != "male"))].hist("Height", bins=300)

incomeknown_df.loc[(incomeknown_df.Income < 100000) & 
                   ((incomeknown_df.Gender != "female") & (incomeknown_df.Gender != "male"))].plot.scatter(x = "Height", y = "Income", c = 'Grey', alpha=0.02)

# CitySize
incomeknown_df.hist("CitySize", bins=60)
incomeknown_df.loc[(incomeknown_df.Income < 200000) & (incomeknown_df.CitySize < 500000)].plot.scatter(x = "CitySize", y = "Income", alpha=0.01)

 

# found a correlation on year linear or slightly x^2 - add a feature year ^2 ?

# income - 'skewed-normal' dist

# height shows an artificial factor adding to a normal dist - off-centre

# feamale and male height look normal 
# non gender height is artificial and even distribution between bounds.
# not sure what this means yet.

# all income distrubutions over height appear normal 


In [0]:
# examine CitySize - odd break around 100K

print(len(incomeknown_df.loc[(incomeknown_df.CitySize < 100000)]),
      incomeknown_df.loc[(incomeknown_df.CitySize < 100000)].Income.mean())


print(len(incomeknown_df.loc[(incomeknown_df.CitySize == 100000)]),
      incomeknown_df.loc[(incomeknown_df.CitySize == 100000)].Income.mean())

print(len(incomeknown_df.loc[(incomeknown_df.CitySize > 100000)]),
      incomeknown_df.loc[(incomeknown_df.CitySize > 100000)].Income.mean())

# interesting - let's add a feature to pick this up simple 1 for >100 0 otherwise
# will help a regression grab this difference

In [0]:
# do the same for unknowns - (with out the corelation incomew !) 


incomeunknown_df.hist("Height", bins=200)

# Height - same pattern of artificial even distribution added to normal

incomeunknown_df.loc[(incomeunknown_df.Gender == "male")].hist("Height", bins=300)

incomeknown_df.loc[(incomeknown_df.Gender == "female")].hist("Height", bins=300)

incomeknown_df.loc[((incomeknown_df.Gender != "female") & (incomeknown_df.Gender != "male"))].hist("Height", bins=300)










In [0]:
# Look at some outliers 

# In colab print() clips wide data - direct output produces better output 
# - but only 1 can be used per block
# print(incomeknown_df.loc[incomeknown_df.Age > 110])
# print(incomeknown_df.loc[incomeknown_df.Income > 2500000])

incomeknown_df.loc[incomeknown_df.Age > 108]
# incomeknown_df.loc[incomeknown_df.Income > 2500000]

In [0]:
incomeknown_df.loc[incomeknown_df.Income > 2250000]

# note the lower case professions in the p q r s t u range 
# and upper case A



In [0]:
incomeknown_df.loc[incomeknown_df.Income < -3000]
# note no p q r s t u  amongst lowest (negative) incomes - despite being the most frequent 
# so there is a clear effect here

In [0]:
# OK time to try and start fixing data and make a working set 
# suitable for feature engineering 

# Lets look at gender

# Need some charts of Gender correlations for the non male / female  and misisng 
# can we just call these all a third group 'other' or do we need more

print ( len( incomeknown_df[['Gender']] ) )
print ( incomeknown_df[['Gender']].describe() )
print ("\n")
print ( incomeknown_df['Gender'].value_counts() )

print ("\n")


# lets look at mean and sd for the non male / female and see how many distinct 
# categories exist.
print ("gender, count, mean, st dev")

gencount = 0
gendr = "male"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count()

gendr = "female"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count()

gendr = "other"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count()

gendr = "unknown"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count()

gendr = "0"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender == gendr].Income.count()

gendr = ".isnull()"
print ( gendr,
        incomeknown_df.loc[incomeknown_df.Gender.isnull()].Income.count(),
        incomeknown_df.loc[incomeknown_df.Gender.isnull()].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Gender.isnull()].Income.std())
gencount += incomeknown_df.loc[incomeknown_df.Gender.isnull()].Income.count()

print (gencount)
print ("\n")

# OK got 'em all

# just confirm we can get them all in unknowns - obviously without the 
# income data


print ( len( incomeunknown_df[['Gender']] ) )
print ( incomeunknown_df[['Gender']].describe() )
print ("\n")
print ( incomeunknown_df['Gender'].value_counts() )
print ("\n")
print ("gender, count")

gencount = 0
gendr = "male"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count()

gendr = "female"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count()

gendr = "other"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count()

gendr = "unknown"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count()

gendr = "0"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender == gendr].Instance.count()

gendr = ".isnull()"
print ( gendr,
        incomeunknown_df.loc[incomeunknown_df.Gender.isnull()].Instance.count())
gencount += incomeunknown_df.loc[incomeunknown_df.Gender.isnull()].Instance.count()

print (gencount)
print ("\n")


In [0]:
# OK fixing Gender : 

# other and unknown look similar  - 0 and null not quite so 
# very tempting to keep them all - however for simplicity we'll just encode as 
# other (adding in unknown) and none (for 0 and null) so 4 catgories

# or keep them all - better, they don't overlap well
# other , unknown , none1, none2

incomeknown_df.Gender.replace("0","none1", inplace=True)
incomeknown_df.Gender.fillna("none2", inplace=True)
print ( incomeknown_df[['Gender']].describe() )
print ("\n")
print ( incomeknown_df['Gender'].value_counts() )
print ("\n")
incomeunknown_df.Gender.replace("0","none1", inplace=True)
incomeunknown_df.Gender.fillna("none2", inplace=True)
print ( incomeunknown_df[['Gender']].describe() )
print ("\n")
print ( incomeunknown_df['Gender'].value_counts() )

In [0]:
# do the same thing with Hair 

print ( len( incomeknown_df[['Hair']] ) )
print ( incomeknown_df[['Hair']].describe() )
print ("\n")
print ( incomeknown_df['Hair'].value_counts() )

print ("\n")


# lets look at mean and sd for eash and see how many distinct 
# categories exist.
haircount = 0
haircol = "Black"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = "Blond"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = "Brown"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = "Red"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = "Unknown"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = "0"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair == haircol].Income.count()

haircol = ".isnull()"
print ( haircol,
        incomeknown_df.loc[incomeknown_df.Hair.isnull()].Income.count(),
        incomeknown_df.loc[incomeknown_df.Hair.isnull()].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Hair.isnull()].Income.std())
haircount += incomeknown_df.loc[incomeknown_df.Hair.isnull()].Income.count()

print (haircount)
print ("\n")

# OK got 'em all

# just confirm we can get them all in unknowns - obviously without the 
# income data


print ( len( incomeunknown_df[['Hair']] ) )
print ( incomeunknown_df[['Hair']].describe() )
print ("\n")
print ( incomeunknown_df['Hair'].value_counts() )

haircount = 0
haircol = "Black"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()

haircol = "Blond"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()

haircol = "Brown"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()


haircol = "Red"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()

haircol = "Unknown"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()

haircol = "0"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair == haircol].Instance.count()

haircol = ".isnull()"
print ( haircol,
        incomeunknown_df.loc[incomeunknown_df.Hair.isnull()].Instance.count())
haircount += incomeunknown_df.loc[incomeunknown_df.Hair.isnull()].Instance.count()

print (haircount)
print ("\n")



In [0]:
# OK fixing Hair : 

# null looks like Black - 
# very tempting to keep them all - however for simplicity we'll just encode as 
# 

# or keep them all - better, they don't overlap well
#  unknown , unknown1, unknown2

incomeknown_df.Hair.replace("0","Unknown1", inplace=True)
incomeknown_df.Hair.fillna("Unknown2", inplace=True)
print ( incomeknown_df[['Hair']].describe() )
print ("\n")
print ( incomeknown_df['Hair'].value_counts() )
print ("\n")
incomeunknown_df.Hair.replace("0","Unknown1", inplace=True)
incomeunknown_df.Hair.fillna("Unknown2", inplace=True)
print ( incomeunknown_df[['Hair']].describe() )
print ("\n")
print ( incomeunknown_df['Hair'].value_counts() )

In [0]:
# and with Degree


print ( len( incomeknown_df[['Degree']] ) )
print ( incomeknown_df[['Degree']].describe() )
print ("\n")
print ( incomeknown_df['Degree'].value_counts() )

print ("\n")


# lets look at mean and sd for eash and see how many distinct 
# categories exist.
degcount = 0
degre = "Bachelor"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count()

degre = "Master"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count()

degre = "PhD"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count()

degre = "No"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count()

degre = "0"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree == degre].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree == degre].Income.count()

degre = ".isnull()"
print ( degre,
        incomeknown_df.loc[incomeknown_df.Degree.isnull()].Income.count(),
        incomeknown_df.loc[incomeknown_df.Degree.isnull()].Income.mean(),
        incomeknown_df.loc[incomeknown_df.Degree.isnull()].Income.std())
degcount += incomeknown_df.loc[incomeknown_df.Degree.isnull()].Income.count()

print (degcount)
print ("\n")

# OK got 'em all

# just confirm we can get them all in unknowns - obviously without the 
# income data


print ( len( incomeunknown_df[['Degree']] ) )
print ( incomeunknown_df[['Degree']].describe() )
print ("\n")
print ( incomeunknown_df['Degree'].value_counts() )

degcount = 0
degre = "Bachelor"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count()

degre = "Master"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count()

degre = "PhD"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count()


degre = "No"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count()

degre = "0"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree == degre].Instance.count()

degre = ".isnull()"
print ( degre,
        incomeunknown_df.loc[incomeunknown_df.Degree.isnull()].Instance.count())
degcount += incomeunknown_df.loc[incomeunknown_df.Degree.isnull()].Instance.count()

print (degcount)
print ("\n")




In [0]:
# OK Fixing Degree

# The different non-degrees look quite different - 
# again we keep them all
# unknown1, unknown2

incomeknown_df.Degree.replace("0","Unknown1", inplace=True)
incomeknown_df.Degree.fillna("Unknown2", inplace=True)
print ( incomeknown_df[['Degree']].describe() )
print ("\n")
print ( incomeknown_df['Degree'].value_counts() )
print ("\n")
incomeunknown_df.Degree.replace("0","Unknown1", inplace=True)
incomeunknown_df.Degree.fillna("Unknown2", inplace=True)
print ( incomeunknown_df[['Degree']].describe() )
print ("\n")
print ( incomeunknown_df['Degree'].value_counts() )

In [0]:
# with professions we just need to fill the nulls
# since we have so many adding an extra one is no harm
# but use an upper case initial letter as there is something going on with 
# the fequency of initial letters in profession

# do any begin with "U" ?

# below only works after nulls are fixed

# for prof in incomeknown_df.Profession.unique() :
#   if prof[0] =="U" :
#     print (prof)

# No 

# do any begin with uppercase ?

# for prof in incomeknown_df.Profession.unique() :
#  if prof[0] <= "Z" :
#    print (prof)
    
# YES mostly A B & C (one O and 1 H)
# also note the "."


print ( len( incomeknown_df[['Profession']] ) )
print ( incomeknown_df[['Profession']].describe() )

print ( len( incomeunknown_df[['Profession']] ) )
print ( incomeunknown_df[['Profession']].describe() )

# and identify any in the unknowns which are not in the knowns.
# need to code this .....

# incomeunknown_df.Profession.replace("Unknown2","Unknown", inplace=True)

for prof in incomeunknown_df.Profession.unique() :
  # print ( prof )
  # print ( type(prof) ) 
  
  if len(incomeknown_df.loc[(incomeknown_df.Profession == prof)]) == 0 :
    print ( prof , 
           len(incomeknown_df.loc[(incomeknown_df.Profession == prof)]) , 
           len(incomeunknown_df.loc[(incomeunknown_df.Profession == prof)]) )

# there are 11 over 21 samples 
# not surprisingly all beginning with a, b or c

# what to do here ? - need the initial letter correlation with income to 
# help make a decision.

# if mean based then use overall mean - but if using hot-1 on initial letter the
# need to substitute these to make it easier.



In [0]:
# fixing professions :

incomeknown_df.Profession.fillna("Unknown", inplace=True)
incomeunknown_df.Profession.fillna("Unknown", inplace=True)

In [0]:
# no need to fix countries all present 
# but is there a country in the unknowns that is not in the knowns

# print ( incomeknown_df['Country'].value_counts() )
# type( incomeknown_df['Country'].value_counts() )
# print ( "in known" )
# print(incomeknown_df[incomeknown_df.Country == "Malawi"])
# print ( "in unknown" )
# print(incomeunknown_df[incomeknown_df.Country == "Malawi"])

for ctry in incomeunknown_df.Country.unique() :
  # print ( ctry )
  # print ( type(ctry) ) 
  
  if len(incomeknown_df.loc[(incomeknown_df.Country == ctry)]) == 0 :
    print ( ctry , 
           len(incomeknown_df.loc[(incomeknown_df.Country == ctry)]) , 
           len(incomeunknown_df.loc[(incomeunknown_df.Country == ctry)]) )
    
# there are 6 of these covering 10 samples = we need to do something with them 
# very tempting to just dump them into another 'middle-ish country'
# but better to look ahead to the encoding - if we use a mean based encoding 
# then these can map to the overall mean.
  

In [0]:
# fixing Age and Year

# just using mean for both 
# need to get past exploration and on to feature engineering now

incomeknown_df.Age.fillna(37.0, inplace=True)
incomeunknown_df.Age.fillna(37.0, inplace=True)

incomeknown_df.Year.fillna(1999.0, inplace=True)
incomeunknown_df.Year.fillna(1999.0, inplace=True)

# Ok so here we have complete data - and we know a lot about it 
# but we havn't engineered any features yet
# and we haven't done any encoding
# save and implement some feature engineering

In [0]:
# repeat the general outputs to verify counts etc.
incomeknown_df.describe()

In [0]:
incomeunknown_df.describe()

In [0]:
incomeknown_df[['Gender','Country','Profession','Degree','Hair']].describe()

In [0]:
incomeunknown_df[['Gender','Country','Profession','Degree','Hair']].describe()

In [0]:
# OK - time to do some fixing.

# don't like the outliers lets kill a few 

# income outliers are going to mess up a lot of things
# but it's a bit scary changing the answer

# how to chnage a value base on its existing value ?

# going to replace all incomes above 500K with 500K

# df.loc[<row selection>, <column selection>]

print ( len (incomeknown_df.loc[incomeknown_df.Income > 1999999] ))
print ( len (incomeknown_df.loc[incomeknown_df.Income > 2000000] ))
print("clipping outliers")

incomeknown_df.loc[incomeknown_df.Income > 2000000, 'Income'] = 2000000

print ( len (incomeknown_df.loc[incomeknown_df.Income > 1999999] ))
print ( len (incomeknown_df.loc[incomeknown_df.Income > 2000000] ))

##_
function definitions

In [0]:
print("def predict(X, theta)")
def predict(X, theta):
  # takes m by n matrix X as input and returns an m by 1 vector 
  # containing the predictions h_theta(x^i) for each row x^i, i=1,...,m in X
  ##### replace the next line with your code #####
  
  # This is the generic multiple variable implementation. 
  
  # For conveniece each data point in X has an initial 1 
  #   added to enable the B0 coeficient to be applied efficiently
  
  # print(X)
  # print(theta)
  # print(X.shape)
  
  # Get the number of coeficients B0, B1 ..... 
  #   we iterate over these
  n=len(theta)
  
  # For the multiple linear regression we need to iterate over the 
  #   length of theta 
  # This assumes a simple B0 + B1.X1 + B2.X2 + ....
  #   it does not consider more complex dependencies
  #   such as B0 + B1.X1 + B2.X2 + B3.X1.X2   etc.
  #   these require the X to be constructed to meet the requirement
  #   and may introduce too much colinearity.
  
  # Check for single data point
  #   requires a special case as it arrives as 1D, rather than 2D
  #   and if it does arrive as 2D the else handles it anyway
  
  # Numpy doesn't appear to support a mechanism to directly 
  #   multiply each row of an array by an equivalent sized vector
  #   so we use a for loop to iterate over the vector but use 
  #   numpy maths, which are more efficient, to apply each element
  #   of the theta vetor to the equivalent data variable for all points. 
   
  if len(X.shape) == 1 : 
    # this is the edge case for a single data point in a vector
    pred = X[0] * theta[0] 
    for i in range(1,n) :
      pred = pred + X[i]*theta[i]  
        
  else : 
    #this is the general case for data point(s) in an array
    pred = X[:,0]*theta[0] 
    for i in range(1,n) :
      pred = pred + X[:,i]*theta[i]    
        
  # print(pred)
  
  # Return the prediction
  # The returned value is a vector of value for each data point.
  return pred
# end def predict


print("def computeCost(X, y, theta)")
def computeCost(X, y, theta):
  # function calculates the cost J(theta) and return its value
  ##### replace the next line with your code #####
  
  # this function is already independent of the number of variables
  # provided the X array carries the data with a leading 1 on each data point
  # and the theta has the same number of B0, B1, B2 ...  values.
  
  # Get the value(s) predicted by the current theta values
  costpred = predict (X,theta)
  
  # get the difference(s) between the predictions and the actuals
  costbase = costpred - y
    
  # get the square of the difference(s)
  costsq = costbase **2
  
  # get the sum of the squared difference(s)
  sumcost = costsq.sum() 
  
  # divide the sum of squares by twice the number of data points
  cost = sumcost/(2*len(y))  
  
  # return the cost of the current theta values
  # the returned value is a numeric value of the cost.
  return cost
# end def computeCost



print("def computeGradient(X, y, theta)")
def computeGradient(X, y, theta):
  # function calulate the gradient of J(theta) and returns its value
  ##### replace the next line with your code #####
  # This function is already capable of delaing with multiple varables
  # provided X (with a leading 1 per data point) and theta are matched sizes.
  
  # number of coeficients
  n=len(theta)
  # initiate the result
  grad = np.zeros(n)
  
  # print ("Number of coeficients: " , n)
  # print ("Shapes of data, result and theta : " , X.shape, y.shape, theta.shape)
  # print ("Types of  data, result and theta : " , type(X), type(y), type(theta))
  # print ("Sample head of data")
  # for i in range(5) :
      # print (X[i,:])
    
  # Get the value predicted by the current theta values
  costpred = predict (X,theta)
  # print ("Compute Grad #2 :", costpred.shape)
  
  # get the difference between the predictions and the actuals
  costbase = costpred - y
  # print ("Compute Grad #3 :", costbase.shape)  
  
  # calculate the gradient for each coefficient
  # for is inefficient but it is used to iterate over a small
  # range (1 more than the number of variables 
  # i.e the number of coefficients)
  # while numpy array maths are used for the larger dimension
  # (the number of data points)
  for i in range(n) :
    # get product of differences by current coeficients data point
    costprod = costbase * X[:,i]
    # print ("Compute Grad #4 :", costprod.shape) 
    # get the sum of the cost products
    costprodsum = costprod.sum()
    # print ("Compute Grad #5 :", costprodsum)
    # divide by the number of data points
    grad[i] = costprodsum / len(y)
    # take this outside the for loop and use numpy maths ! 
    # print ("Compute Grad #6 :", grad[i])
  # end for
  # print ("Compute Grad #6 :", grad)
  
  # return the gradient
  # the returned value of a vector of a gradient for each coeficient (theta) 
  return grad
# end def computeGradient


print("def gradDescent(X, y, theta, iters, alpha)")
def gradDescent(X, y, theta, iters, alpha):
  # X - data (with the extra 1s in col 1)
  # y - results - vector with 1 resilt per data point
  # theta - B0, B1, B2 ... starting values 
  # iters - how many iterations
  # alpha - adjustor for size of adjustment per step
  
  # returns : 
  # theta - B0, B1, B2 ... final values 
  # cost - cost after each iteration
  
  # initialize cost array
  cost = np.zeros(iters)
  
  for i in range(iters):
    theta = theta - alpha * computeGradient(X,y,theta)
    cost[i] = computeCost(X, y, theta)

  return theta, cost
# end def gradientDescent   
  

print (datetime.now())

In [0]:

print("def gradientDescent(X, y, numparams)")
def gradientDescent(X, y, numparams):
  # iteratively update parameter vector theta
  # -- you should not modify this function

  # initialize variables for learning rate and iterations
  alpha = 0.02
  iters = 5000
  cost = np.zeros(iters)
  theta= np.zeros(numparams)

  for i in range(iters):
    theta = theta - alpha * computeGradient(X,y,theta)
    cost[i] = computeCost(X, y, theta)

  return theta, cost



print("def normaliseData(x)")
def normaliseData(x):
  # rescale data to lie between 0 and 1
  scale = x.max(axis=0)
  return (x/scale, scale)



print("def splitData(X, y)")
def splitData(X, y):
  # split data into training and test parts
  # ... for now, we use all of the data for training and testing
  Xtrain=X; ytrain=y; Xtest=X; ytest=y
  return (Xtrain, ytrain, Xtest, ytest)


print (datetime.now())

##_
_main_

In [0]:
print ("def main()")
def main():
  # load the data
  # "https://raw.githubusercontent.com/johnsl01/Titanic_Python/master/titanic_known.csv"
  data=np.loadtxt('https://raw.githubusercontent.com/johnsl01/linreg/master/stockprices.csv',usecols=(1,2))
  X=data[:,0]
  y=data[:,1]

  # plot the data so we can see how it looks
  # (output is in file graph.png)
  fig, ax = plt.subplots(figsize=(12, 8))
  ax.scatter(X, y, label='Data')
  ax.set_xlabel('Amazon')
  ax.set_ylabel('Google')
  ax.set_title('Google stock price vs Amazon')
  fig.savefig('graph.png')

  # split the data into training and test parts
  (Xtrain, ytrain, Xtest, ytest)=splitData(X,y)

  # add a column of ones to input data
  m=len(y) # m is number of training data points
  Xtrain = np.column_stack((np.ones((m, 1)), Xtrain))
  (m,n)=Xtrain.shape # m is number of data points, n number of features

  # rescale training data to lie between 0 and 1
  (Xt,Xscale) = normaliseData(Xtrain)
  (yt,yscale) = normaliseData(ytrain)

  # calculate the prediction
  print('testing the prediction function ...')
  theta=(1,2)
  print('when x=[1,1] and theta is [1,2]) cost = ',predict(np.ones(n),theta))
  print('approx expected prediction is 3')
  print('when x=[[1,1],[5,5]] and theta is [1,2]) cost = ',predict(np.array([[1,1],[5,5]]),theta))
  print('approx expected prediction is [3,15]')
  input('Press Enter to continue...')

  # calculate the cost when theta iz zero
  print('testing the cost function ...')
  theta=np.zeros(n)
  print('when theta is zero cost = ',computeCost(Xt,yt,theta))
  print('approx expected cost value is 0.318')
  input('Press Enter to continue...')

  # calculate the gradient when theta is zero
  print('testing the gradient function ...')
  print('when theta is zero gradient = ',computeGradient(Xt,yt,theta))
  print('approx expected gradient value is [-0.79,-0.59]')
  input('Press Enter to continue...')

  # perform gradient descent to "fit" the model parameters
  print('running gradient descent ...')
  theta, cost = gradientDescent(Xt, yt, n)
  print('after running gradientDescent() theta=',theta)
  print('approx expected value is [0.34, 0.61]')

  # plot some predictions
  Xpred = np.linspace(X.min(), X.max(), 100)
  Xpred = np.column_stack((np.ones((100, 1)), Xpred))
  ypred = predict(Xpred/Xscale, theta)*yscale
  fig, ax = plt.subplots(figsize=(12, 8))
  ax.scatter(Xtest, ytest, color='b', label='Test Data')
  ax.plot(Xpred[:,1], ypred, 'r', label='Prediction')
  ax.set_xlabel('Amazon')
  ax.set_ylabel('Google')
  ax.legend(loc=2)
  fig.savefig('pred.png')

  # and plot how the cost varies as the gradient descent proceeds
  fig2, ax2 = plt.subplots(figsize=(12, 8))
  ax2.semilogy(cost,'r')
  ax2.set_xlabel('iteration')
  ax2.set_ylabel('cost')
  fig2.savefig('cost.png')
  
  # plot the cost function
  fig3 = plt.figure()
  ax3 = fig3.add_subplot(1, 1, 1, projection='3d')
  n=100
  theta0, theta1 = np.meshgrid(np.linspace(-3, 3, n), np.linspace(-3, 2, n))
  cost = np.empty((n,n))
  for i in range(n):
    for j in range(n):
      cost[i,j] = computeCost(Xt,yt,(theta0[i,j],theta1[i,j]))
  ax3.plot_surface(theta0,theta1,cost)
  ax3.set_xlabel('theta0')
  ax3.set_ylabel('theta1')
  ax3.set_zlabel('J(theta)')
  fig3.savefig('J.png')
  
# end def main()  
  
print (datetime.now())


In [0]:
print (datetime.now())

main()

print (datetime.now())


##_ 
_testing section_

In [0]:
# test section (predict)
# assume all defs are in place 
# but main hasn't run
print("Test of multi variable prediction")
print("Three variables - five data points - with B0")
test_array = np.array([[1,2,3,5],
                       [1,7,11,13],
                       [1,17,19,23],
                       [1,29,31,37],
                       [1,41,43,47]])
print ("shape of test data : " , test_array.shape )
test_theta = np.array([1,2,3,5])
print ("shape of test theta : ", test_theta.shape )

test_predict = predict (test_array, test_theta)
print(test_predict)



In [0]:
print ("Test of multi variable cost")
print ("cost when theta is all zeros and when theta is correct")
test_array = np.array([[1,2,3,5],
                       [1,7,11,13],
                       [1,17,19,23],
                       [1,29,31,37],
                       [1,41,43,47]])
print ("shape of test data : " , test_array.shape )
test_theta = np.array([0,0,0,0])
print ("shape of test theta : ", test_theta.shape )
test_results = np.array([22,61,117,191,261])
print ("shape of test results : ", test_results.shape )
good_theta = np.array([5,3,2,1])
print ("shape of correct theta : ", good_theta.shape )

test_cost =  computeCost(test_array, test_results, test_theta)
print ("cost at theta all zeros : " , test_cost)

good_cost =  computeCost(test_array, test_results, good_theta)
print ("cost at correct theta : " , good_cost)

In [0]:
print ("Test of Gradient Descent")
print ("At zero theta and correct theta")
test_array = np.array([[1,2,3,5],
                       [1,7,11,13],
                       [1,17,19,23],
                       [1,29,31,37],
                       [1,41,43,47]])
print ("shape of test data : " , test_array.shape )
test_theta = np.array([0,0,0,0])
print ("shape of test theta : ", test_theta.shape )
test_results = np.array([22,61,117,191,261])
print ("shape of test results : ", test_results.shape )
good_theta = np.array([5,3,2,1])

test_grad =  computeGradient(test_array, test_results, test_theta)
print ("gradient at theta all zeros : " , test_grad)

good_grad =  computeGradient(test_array, test_results, good_theta)
print ("gradient at theta correct : " , good_grad)

In [0]:
# full implementation with multi variable test data 
X = np.array([[2,3,5],
              [7,11,13],
              [17,19,23],
              [29,31,37],
              [41,43,47]])
print ("shape of test data : " , X.shape )
print ("test data : \n ", X)

# perfectly linear results
# y = np.array([22,61,117,191,261])

# Or with some some non linearity 
y = np.array([23,60,118,193,260])

print ("shape of test results : ", y.shape )
print ("results : " , y )

good_theta = np.array([5,3,2,1])

# Starting Position 
# note : without col of 1s in data (they get added below)


# split the data into training and test parts
#   note : current split is 100% train
(Xtrain, ytrain, Xtest, ytest)=splitData(X,y)

# add a column of ones to input data
m=len(ytrain) # m is number of training data points
# note this is a flaw in the original code as it was len(y) which only works if
# the train/test split is 100% - so it needs to be len (ytrain) to avoid confusion
Xtrain = np.column_stack((np.ones((m)), Xtrain))

print ("Shape of training data : " , Xtrain.shape)
print ("Training Data : \n " , Xtrain)
 
(m,n)=Xtrain.shape # m is number of data points, n number of features

# rescale training data to lie between 0 and 1
(Xt,Xscale) = normaliseData(Xtrain)
(yt,yscale) = normaliseData(ytrain)

print ("Shape of training data : " , Xt.shape)
print ("Training Data : \n " , Xt)

# perform gradient descent to "fit" the model parameters
# print('running gradient descent ...')
# theta, cost = gradientDescent(Xt, yt, n)
# print('after running gradientDescent() theta=',theta , "\n at cost : " , cost)

theta_init = np.zeros(n)
iters = 1000000
alpha = 0.1

initial_cost = computeCost ( Xt, yt, theta_init)
print ("Initial theta : " , theta_init )
print ("Initial cost = ", initial_cost)


print('running gradient descent ...')
theta_new, cost = gradDescent(Xt, yt, theta_init, iters, alpha)
# print('after running gradientDescent() theta=',theta , "\n at cost : " , cost)

final_cost = cost[len(cost)-1]

print ("final theta : " , theta_new )
print ("final cost = ", final_cost)


# and plot how the cost varies as the gradient descent proceeds
fig2, ax2 = plt.subplots(figsize=(12, 8))
ax2.semilogy(cost,'r')
ax2.set_xlabel('iteration')
ax2.set_ylabel('cost')
# fig2.savefig('cost.png')

theta_init = theta_new
iters = 5
alpha = 0.1

initial_cost = computeCost ( Xt, yt, theta_init)
print ("Initial theta : " , theta_init )
print ("Initial cost = ", initial_cost)


print('running gradient descent ...')
theta_new, cost = gradDescent(Xt, yt, theta_init, iters, alpha)
# print('after running gradientDescent() theta=',theta , "\n at cost : " , cost)

final_cost = cost[len(cost)-1]

print ("final theta : " , theta_new )
print ("final cost = ", final_cost)


# and plot how the cost varies as the gradient descent proceeds
fig3, ax3 = plt.subplots(figsize=(20, 8))
ax3.semilogy(cost,'r')
ax3.set_xlabel('iteration')
ax3.set_ylabel('cost')
# fig2.savefig('cost.png')

In [0]:
print (test_array, "\n")
print (test_array * 2, "\n")

In [0]:
print (test_array + test_array, "\n")
print (test_array[0,:], "\n")
print (test_array[1,:], "\n")
print (test_array[2,:], "\n")
print (test_array[:,0], "\n")
print (test_array[:,1], "\n")

In [0]:
predict_1 = np.zeros((test_array.shape[0],test_array.shape[1]))
print (predict_1.shape, "\n")
print (predict_1, "\n")

print (predict_1[0,:], "\n")
print (predict_1[1,:], "\n")
print (predict_1[2,:], "\n")
print (predict_1[:,0], "\n")
print (predict_1[:,1], "\n")

In [0]:
testnum = "Case #1"
myX = np.array([1,1])
mytheta = (1,2)

In [0]:
testnum = "Case #2"
myX = np.array([[1,1],[5,5]])
mytheta = (1,2)

In [0]:
print (testnum)
print (datetime.now(), "\n")
print ("Shape of myX : ", myX.shape, "\n")
print ("Type of myX.shape : " , type (myX.shape), "\n")
print ("Len of myX.shape : ", len(myX.shape), "\n")
if len(myX.shape) == 1 : 
  print ("myX col 0 : ", myX[0], "\n")
  print ("myX col 1 : ", myX[1], "\n")
else : 
  print ("myX col 0 : ", myX[:,0], "\n")
  print ("myX col 1 : ", myX[:,1], "\n")
print (mytheta[0], "\n")
print (mytheta[1], "\n", "\n")
mypredict_1 = predict(myX,mytheta)
print (mypredict_1.shape, "\n")
print (mypredict_1, "\n", "\n")
print (datetime.now(), "\n")

In [0]:
# multi dimensional test example
# need two dependant variables 
# will produce a planar regression.
# need some data
# use the existing amazon and google data 
# but generate a new result variable
# based on an integer multiple of each of the variables 
# with a randon factor thrown into each plus an additional 
# randon factor.
data=np.loadtxt('https://raw.githubusercontent.com/johnsl01/linreg/master/stockprices.csv',usecols=(1,2))
# X=data[:,0]
# y=data[:,1]
X = data
y = X[:,0]*3 
y = y + X[:,1]*2
print (X.shape, y.shape)





