# Using satellite imagery to train a model for identifying the type of landmarks

## Sampling

Because of the memory limitations of my computer,
I cannot load the entire dataset into memory and perform operations on it.
As a result,
I will instead sample both datasets into ${20000}$ examples
before working with them.

In [6]:
# the sampling script that I wrote
from sample_dataset import mainarg as sample_from_filename

# filenames for training and test data
data_filenames = ['dataset/X_train_sat4.csv', 'dataset/X_test_sat4.csv']

# loop through filenames
for filename in data_filenames:
    sample_from_filename(filename)
# next filename


===reading from===
dataset\X_train_sat4.csv
dataset\y_train_sat4.csv
===writing to===
dataset\X_train_sat4_samp20000.csv
dataset\y_train_sat4_samp20000.csv
===sampling from 399999 examples===
0. copying from <_io.TextIOWrapper name='dataset\\y_train_sat4.csv' mode='r' encoding='cp1252'>
1. copying from <_io.TextIOWrapper name='dataset\\X_train_sat4.csv' mode='r' encoding='cp1252'>
writing to dataset\y_train_sat4_samp20000.csv
writing to dataset\X_train_sat4_samp20000.csv

===reading from===
dataset\X_test_sat4.csv
dataset\y_test_sat4.csv
===writing to===
dataset\X_test_sat4_samp20000.csv
dataset\y_test_sat4_samp20000.csv
===sampling from 99999 examples===
0. copying from <_io.TextIOWrapper name='dataset\\y_test_sat4.csv' mode='r' encoding='cp1252'>
1. copying from <_io.TextIOWrapper name='dataset\\X_test_sat4.csv' mode='r' encoding='cp1252'>
writing to dataset\y_test_sat4_samp20000.csv
writing to dataset\X_test_sat4_samp20000.csv


# Preprocess data

Now we may work with the data.

In [7]:
import pandas as pd                                     # for the dataframes
from sklearn.linear_model import LinearRegression       # for the learning models

In [8]:
# read in the training data
X_train = pd.read_csv('dataset/X_test_sat4_samp20000.csv')
y_train = pd.read_csv('dataset/y_train_sat4_samp20000.csv')

In [9]:
print(X_train.info())
print(y_train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Columns: 3136 entries, 216 to 217.80
dtypes: int64(3136)
memory usage: 478.5 MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   0       19999 non-null  int64
 1   1       19999 non-null  int64
 2   0.1     19999 non-null  int64
 3   0.2     19999 non-null  int64
dtypes: int64(4)
memory usage: 625.1 KB
None


In [10]:
lr = LinearRegression()
lr.fit(X_train, y_train)

print(r'bias:\t{}'.format(lr.intercept_))
print(r'weights:\t{}'.format(lr.coef_))

bias:\t[0.25037683 0.23082563 0.19187285 0.32692469]
weights:\t[[ 1.88310289e-04  8.76441051e-04 -4.95686281e-04 ...  1.14396931e-03
   2.17969825e-05  4.85382527e-04]
 [ 2.01765019e-04 -1.14309711e-03  3.93450135e-04 ... -4.36356295e-04
  -3.13089206e-04  6.56217496e-04]
 [-1.83960392e-04  6.65777439e-04 -8.57771811e-04 ...  2.06162149e-04
   9.23855644e-05 -3.82213849e-04]
 [-2.06114915e-04 -3.99121378e-04  9.60007957e-04 ... -9.13775162e-04
   1.98906659e-04 -7.59386174e-04]]
