# Using satellite imagery to train a model for identifying the type of landmarks

## Sampling

Because of the memory limitations of my computer,
I cannot load the entire dataset into memory and perform operations on it.
As a result,
I will instead sample both datasets into ${20000}$ examples
before working with them.

In [None]:
# the sampling script that I wrote
from sample_dataset import mainarg as sample_from_filename

# filenames for training and test data
data_filenames = ['dataset/X_train_sat4.csv', 'dataset/X_test_sat4.csv']

# loop through filenames
for filename in data_filenames:
    sample_from_filename(filename)
# next filename


===reading from===
dataset\X_train_sat4.csv
dataset\y_train_sat4.csv
===writing to===
dataset\X_train_sat4_samp20000.csv
dataset\y_train_sat4_samp20000.csv
===sampling from 399999 examples===
0. copying from <_io.TextIOWrapper name='dataset\\y_train_sat4.csv' mode='r' encoding='cp1252'>
1. copying from <_io.TextIOWrapper name='dataset\\X_train_sat4.csv' mode='r' encoding='cp1252'>
writing to dataset\y_train_sat4_samp20000.csv
writing to dataset\X_train_sat4_samp20000.csv

===reading from===
dataset\X_test_sat4.csv
dataset\y_test_sat4.csv
===writing to===
dataset\X_test_sat4_samp20000.csv
dataset\y_test_sat4_samp20000.csv
===sampling from 99999 examples===
0. copying from <_io.TextIOWrapper name='dataset\\y_test_sat4.csv' mode='r' encoding='cp1252'>
1. copying from <_io.TextIOWrapper name='dataset\\X_test_sat4.csv' mode='r' encoding='cp1252'>
writing to dataset\y_test_sat4_samp20000.csv
writing to dataset\X_test_sat4_samp20000.csv


## Preprocess data

Now we may work with the data.

Start by importing necessary modules
and setting up important constants.

In [None]:
import pandas as pd                                         # for the dataframes
from sklearn.linear_model import LinearRegression           # for the learning models

In [None]:
# constants
X_TRAIN_FILENAME = r'dataset/X_train_sat4_samp20000.csv'    # filename of the dataset input
Y_TRAIN_FILENAME = r'dataset/y_train_sat4_samp20000.csv'    # filename of the dataset input

Read in the files
and do a high level inspection.

In [None]:
# read in the training data
X_train = pd.read_csv(X_TRAIN_FILENAME)
y_train = pd.read_csv(Y_TRAIN_FILENAME)

In [None]:
# combine training data features, labels
df_train = pd.concat([X_train, y_train], axis=0)

In [None]:
# print shapes of X, y
print("X_train shape\t{}".format(X_train.shape))
print("y_train shape\t{}".format(y_train.shape))
print("combined shape\t{}".format(df_train.shape))

X_train shape	(19999, 3136)
y_train shape	(19999, 4)
combined shape	(39998, 3140)


In [None]:
# print some basic information about the dataset
print('\n===data frame information===')
df_train.info()

# print its parameters
print('\n===data frame parameters===')
df_train.describe()


===data frame information===
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39998 entries, 0 to 19998
Columns: 3140 entries, 122 to 0.2
dtypes: float64(3140)
memory usage: 958.5 MB

===data frame parameters===


Unnamed: 0,122,136,126,197,106,115,95,180,69,68,...,77.14,165.3,96.30,103.31,94.37,179.9,0,1,0.1,0.2
count,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,...,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0,19999.0
mean,127.778989,123.954298,110.979049,158.80899,127.60998,123.837242,110.864343,158.689084,127.610531,123.910246,...,111.153508,158.80419,128.174759,124.218011,111.20106,158.912496,0.263813,0.20156,0.178609,0.356018
std,42.826465,37.945785,35.706565,37.819509,42.947181,38.090841,35.787016,37.862464,42.863217,37.942215,...,35.980356,37.692666,42.905734,37.908342,35.641014,37.598032,0.44071,0.401175,0.383034,0.478833
min,0.0,3.0,1.0,4.0,0.0,2.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0
25%,98.0,99.0,89.0,140.0,98.0,99.0,89.0,140.0,98.0,99.0,...,89.0,140.0,99.0,100.0,89.0,140.0,0.0,0.0,0.0,0.0
50%,124.0,122.0,110.0,166.0,123.0,122.0,109.0,166.0,123.0,122.0,...,110.0,166.0,124.0,122.0,110.0,166.0,0.0,0.0,0.0,0.0
75%,159.0,148.0,132.0,185.0,159.0,148.0,132.0,185.0,158.0,148.0,...,133.0,185.0,159.0,148.0,133.0,185.0,1.0,0.0,0.0,1.0
max,255.0,255.0,255.0,253.0,244.0,255.0,255.0,252.0,246.0,255.0,...,255.0,254.0,248.0,251.0,255.0,245.0,1.0,1.0,1.0,1.0


In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

print(r'bias:\t{}'.format(lr.intercept_))
print(r'weights:\t{}'.format(lr.coef_))

bias:\t[0.25037683 0.23082563 0.19187285 0.32692469]
weights:\t[[ 1.88310289e-04  8.76441051e-04 -4.95686281e-04 ...  1.14396931e-03
   2.17969825e-05  4.85382527e-04]
 [ 2.01765019e-04 -1.14309711e-03  3.93450135e-04 ... -4.36356295e-04
  -3.13089206e-04  6.56217496e-04]
 [-1.83960392e-04  6.65777439e-04 -8.57771811e-04 ...  2.06162149e-04
   9.23855644e-05 -3.82213849e-04]
 [-2.06114915e-04 -3.99121378e-04  9.60007957e-04 ... -9.13775162e-04
   1.98906659e-04 -7.59386174e-04]]
