https://www.itl.nist.gov/div898/handbook/

# Welcome

This notebook has functions to help handle common tasks

**Functions**
- FillNa DropNa Replace
- CustomLambda = Lambda X:(x+x%2)
- df.groupby().transform(customLambda):sumAgg
- Pd.melt()-\&gt;columnRows
- DummyEncode
- Df.Stack.Unstack
- Infer data types isfinite-inf/nan, isnan, first char is symbol, default value.

#### Common Python Data Manipulations

https://datascience.stackexchange.com/questions/37878/difference-between-isna-and-isnull-in-pandas

**Common Python Data Manipulations**
 - .isna(), .fillna(), .isnull()
 - .dropna(how=&#39;any&#39;),
 - .fillna(method=&#39;ffill&#39;, inplace=true), method=&#39;ffill&#39;, .fillna(value=0, inplace=true)
 - .duplicated(), .unique(), .drop\_duplicates()
 - .replace()
 - groupby()

- contains(), within() for geospatial data.
**Common Python Cleaning operations:**
1. Check the data types of all column in the data-frame
 2. Create a new data-frame excluding all the &#39;object&#39; types column
 3. Select elements from each column that lie within 3 units of Z score
- .cut() will bin your data
 - .dtypes, -.select\_dtypes(exclude=[&#39;object&#39;])
 - stats.zscore(df)

#### FILTERING

* DataFrame.isna()	Detect missing values.
* DataFrame.any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
* DataFrame.all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
* DataFrame.filter([items, like, regex, axis])	Subset rows or columns of dataframe according to labels in the specified index.
* DataFrame.dropna([axis, how, thresh, …])	Remove missing values.
* DataFrame.fillna([value, method, axis, …])	Fill NA/NaN values using the specified method.
* DataFrame.replace([to_replace, value, …])	Replace values given in to_replace with value.
* DataFrame.interpolate([method, axis, limit, …])	Interpolate values according to different methods.
* DataFrame.nlargest(n, columns[, keep])	Return the first n rows ordered by columns in descending order.
* DataFrame.nsmallest(n, columns[, keep])	Return the first n rows ordered by columns in ascending order.

#### GROUPING/ Aggregating/ Manipulating

* DataFrame.pivot([index, columns, values])	Return reshaped DataFrame organized by given index / column values.
*df.agg("mean", axis="columns") # axis : {0 or ‘index’, 1 or ‘columns’}, default 0
* DataFrame.compound(axis=None, skipna=None, level=None)
* DataFrame.count(axis=0, level=None, numeric_only=False)[source]
* df.groupby(['1_tpop']).mean()
* DataFrame.insert(loc, column, value[, …])	Insert column into DataFrame at specified location.

#### Common Python Cleaning operations:

*   1. Check the data types of all column in the data-frame
*   2. Create a new data-frame excluding all the ‘object’ types column
*   3. Select elements from each column that lie within 3 units of Z score
*   .cut() will bin your data
*   .dtypes, -.select_dtypes(exclude=[‘object’])

biggest data cleaning task, missing values

Pandas will recognize both empty cells and “NA” types as missing values. Anything else should to be specified on import

In [None]:
print df['ST_NUM'].isnull()

# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property data.csv", na_values = missing_values)

In [None]:
#Dealing with missing values? How many np.nan per column?
df.isna().sum()

In the code we’re looping through each entry in the “Owner Occupied” column. To try and change the entry to an integer, we’re using int(row).
If the value can be changed to an integer, we change the entry to a missing value using Numpy’s np.nan.
On the other hand, if it can’t be changed to an integer, we pass and keep going.  The .loc method is the preferred Pandas method for modifying entries in place. https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.loc.html

In [None]:
# Total missing values for each feature
print df.isnull().sum()

In [None]:
# Any missing values?
print df.isnull().values.any()

In [None]:
# Total number of missing values
print df.isnull().sum().sum()

In [None]:
# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)

In [None]:
# Location based replacement
df.loc[2,'ST_NUM'] = 125

In [None]:
# Replace using median 
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)

In [None]:
# This aggreates the data by its column names, then we pass the aggregation function (size = count)
df.groupby(by =['class', 'doctor_name']).size()

In [None]:
 #drop rows with any column having np.nan values
df = df.dropna(axis = 1, how = 'all') 

In [None]:
# Detecting numbers 
cnt=0
for row in df['OWN_OCCUPIED']:
    try:
        int(row)
        df.loc[cnt, 'OWN_OCCUPIED']=np.nan
    except ValueError:
        pass
    cnt+=1

In [None]:
# Number of unique values in each column.
df.nunique()

# We see here that although there are 699 rows, there are only 645 unique patient_id’s. 
# This could mean that some patient appear more than once in the dataset
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html
df[df.duplicated(subset = 'patient_id', keep =False)].sort_values('patient_id')

# The number of times a patient shows up in the dataset can also be viewed.
repeat_patients = df.groupby(by = 'patient_id').size().sort_values(ascending =False)

# to remove patients that show up more that 2 times in the data set.
filtered_patients = repeat_patients[repeat_patients > 2].to_frame().reset_index()
filtered_df = df[~df.patient_id.isin(filtered_patients.patient_id)]

In [None]:
# Its good to inspect unique key identifiers
df.nunique()

# This shows rows that show up more than once and have the exact same column values. 
df[df.duplicated(keep = 'last')]

# # This shows all instances where pantient_id shows up more than once, but may have varying column values
# df[df.duplicated(subset = 'patient_id', keep =False)].sort_values('patient_id')

#Now that I have seen that there are some duplicates, I am going to go ahead and remove any duplicate rows
#, same things that occours twice

df = df.drop_duplicates(subset = None, keep ='first')

repeat_patients = df.groupby(by = 'patient_id').size().sort_values(ascending =False)
repeat_patients

filtered_patients = repeat_patients[repeat_patients > 2].to_frame().reset_index()
filtered_df = df[~df.patient_id.isin(filtered_patients.patient_id)]
filtered_df

# This is all the repeating patients details

df[df.patient_id.isin(filtered_patients.patient_id)]

In [None]:
# One Hot Encoding Catergorical Data
categorical_df = df[['patient_id', 'doctor_name']]

# This specifies all rows (':') and column name 'doctor_count'
categorical_df.loc[:,'doctor_count'] = 1

doctors_one_hot_encoded  = pd.pivot_table(categorical_df
                                  ,index = categorical_df.index, 
                                  columns = ['doctor_name'], values = ['doctor_count'])

doctors_one_hot_encoded = doctors_one_hot_encoded.fillna(0)

doctors_one_hot_encoded.columns = doctors_one_hot_encoded.columns.droplevel()

# Typically a left join in pandas looks like this:
# leftJoin_df = pd.merge(df1, df2, on ='col_name', how='left')
# However we are joining on the index so we pass the “left_index” and “right_index” option 
# to specify that the join key is the index of both tables
combined_df = pd.merge(df, doctors_one_hot_encoded, left_index = True,right_index =True, how ='left')
combined_df

In [None]:
# How to convert benign & malingant to 0 and 1
class_to_numerical_dictionary = {'benign':0, 'malignant':1}
combined_df['class'] = combined_df['class'].map(class_to_numerical_dictionary)
combined_df

In [None]:
# Feature building: 
# combined_df[~(combined_df.cell_size_uniformity >5) & (combined_df.cell_shape_uniformity >5)]
def celltypelabel(x):
    if ((x['cell_size_uniformity'] > 5) & (x['cell_shape_uniformity'] > 5)): return('normal')
    else: return('abnormal')
      
combined_df['cell_type_label'] = combined_df.apply(lambda x: celltypelabel(x), axis=1)

# Read In Data

In [None]:
import pandas as pd
import csv
import IPython
from google.colab import output
import json
import geopandas

# Functions created here are called by the server
df = 'test'

# All data is returned in the same exact fasion. 
# pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html
# ‘split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
def importDataCSVPy(fileName):
  global df
  df = pd.read_csv(fileName,quoting=csv.QUOTE_ALL)
  return df.to_json(orient='split') 
 
def importDataXLPy(fileName):
  global df
  df = geopandas.read_excel(fileName)
  return df.to_json(orient='split')

def importDataGEOJSONPy(fileName):
  global df
  df = geopandas.read_file(fileName)
  return df.to_json(orient='split')

def importDataJSONPy(fileName):
  global df
  df = pd.read_json(fileName)
  return df.to_json(orient='split') 


# dashboards notes reduced

> The functions that transform notebooks in a library

Basic Text

TODO :
*  Provide a user of Import Options ,
*  Ask For file, Default = False
*  Ask For Delimiters, Default = ,
*  Ask For String Delimiters, Default = "
*  Ask If First Column Represents Header, Default = False
*  Ask If the Column Names are Correct

FillNA = -1, avg

FillNA THEN Coerce

### Todo: 
* Interactive Inputs allow user to perform Simple Querys
* Fixed Dictionary [ distinct, not, like, avg, min, max, mean, median, mode ]
*   Query Replaces the Imported Dataset
*   Repeat until user specifies otherwise
*  Template: Select From Where GroupBy Having

## MISC 

### Import 

In [None]:
td.filter(regex="Births").hist()

## Parse The DataTypes

In [None]:
# CASTING NOTES
# DataFrame.infer_objects()	Attempt to infer better dtypes for object columns.
# DataFrame.astype(dtype[, copy, errors])	Cast a pandas object to a specified dtype dtype.
# CASTING and COERCING => df['y'] = pd.to_numeric(df['y'],errors='coerce')
# Coerce: Ignore, Coerce, Raise
# DataFrame.to_period([freq, axis, copy])	Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed).
# DataFrame.to_timestamp([freq, how, axis, copy])	Cast to DatetimeIndex of timestamps, at beginning of period.
# DataFrame.tz_convert(tz[, axis, level, copy])	Convert tz-aware axis to target time zone.

df3.head()
# To Numeric
df3 = df3[ df3.columns ].apply(pd.to_numeric, errors='ignore')

# Parse Date Times
# DataFrame.to_timestamp([freq, how, axis, copy])	Cast to DatetimeIndex of timestamps, at beginning of period.

# Parse GEOMETRY Coordinates 
# df['y'] = pd.to_numeric(df['y'],errors='coerce')
# df = df.replace(np.nan, 0, regex=True)
# gdf = GeoDataFrame(df.drop(['x', 'y'], axis=1), crs={'init': 'epsg:2248'}, geometry=[shapely.geometry.Point(xy) for xy in zip(df.x, df.y)])

In [None]:
# Perform the Casting.
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count==total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))

In [None]:
# Get File Types By Type
types = df.dtypes
bools = df.select_dtypes(include='bool')
ints = df.select_dtypes(include=['float','integer'] )
df.ftypes
df.get_dtype_counts()
df.get_ftype_counts()

In [None]:
Data Type Conversion using to_datetime() and astype() methods:

Example of to_datetime(): train['Date.of.Birth']= pd.to_datetime(train['Date.of.Birth'])

Example of astype(): train['ltv'] = train['ltv'].astype('int64')

In [None]:
# Length - The len function counts the number of observations in a Series. The function will count all observations, regardless if there are missing or null values.
length = len(beers["ibu"])
# Count - The count function will return the number of non-NA/non-null observations in a Series.
count = beers["ibu"].count()
# Missing Values - With the Length and the Count, we are now able to calculate the number of missing values. The number of missing values is the difference between the Length and the Count.
pct_of_missing_values = "{0:.1f}%".format( float(length - count / length) *100)
# Minimum/Maximum Value - The minimum and maximum value of a dataset can easily be obtained with the min and max function on a Series.
# Quantile Statistics - Quantiles are cut points that split a distribution in equal sizes. Many quantiles have their own name. If you split a distribution into four equal groups, the quantile you created is named quartile. You can easily create quantile using the quantile function on a Series. You can pass to that function an array with the different quantiles to compute. In the case below, we want to split our distribution in four equal groups.
quantile = beers["ibu"].quantile([.25, .5, .75])
# we can’t talk about data profiling without mentioning the importance of a frenquency-distribution plot. It is one of the simplest yet most powerful visualization. It demonstrates the frequency of each value in our dataset.
sns.distplot(beers["ibu"].dropna());
# Correlations - Correlations are a great way to discover relationships between numerical variables. There are various ways to calculate the correlation. The Pearson correlation coefficient is a widely used approach that measures the linear dependence between two variables. The correlation coefficient ranges from -1 to 1. A correlation of 1 is a total positive correlation, a correlation of -1 is a total negative correlation and a correlation of 0 is non-linear correlation. We can perform that calculation using the corr function on a Series. By default, this function will use the Pearson correlation coefficient calculation. It is possible to use different methods of calculation with this function.
beers[["abv", "ibu", "ounces"]].corr()

# NOTES

In [None]:
  # To 'import' a script you wrote, map its filepath into the sys
  
# https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

Exponentially-weighted moving window functions ( mean std var corr cov) 
Standard expanding window functions ( sum, mean, median, var, std, min, max, corr, cov, skew, kurt, agg, quant)
Standard moving window functions (count, sum, mean, median, var, std, min, max, corr, cov, skew, kurt, aggr, quant)

df( columns, dtypes, ftypes, select_dtypes, values[ReturnsNumpyArrOfArrays], shape[records,Cols] size(recordsXcols), empty[T/F]
df( astype[castDtype], infer_objects, copy, isna, notna )
df( add, sub, mul, div, mod, pow)
df( abs, all, any, corr, count, cov, cummax, cummin, cumprod, cumsum, describe, kurt, mad[mean_abs_dev], max, mean, median, min, mode, pct_change[curVSpriorElem] , prod, quantile, rank, round, sem[stdErrMean], skew, sum, std, var, nunique[num_unique] )

(addprefix, addsuffix, at_time, between_time, drop, drop_duplicates, equals, head, last)
(dropna, fillna, replace, interpolate)
(pivot, pivotTable, sort_values, nlargest, nsmallest, )
DataFrame.to_timestamp(self[, freq, how, …])	Cast to DatetimeIndex of timestamps, at beginning of period.
   
DataFrame.plot([x, y, kind, ax, ….])	DataFrame plotting accessor and method
DataFrame.plot.area(self[, x, y])	Draw a stacked area plot.
DataFrame.plot.bar(self[, x, y])	Vertical bar plot.
DataFrame.plot.barh(self[, x, y])	Make a horizontal bar plot.
DataFrame.plot.box(self[, by])	Make a box plot of the DataFrame columns.
DataFrame.plot.density(self[, bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.hexbin(self, x, y[, C, …])	Generate a hexagonal binning plot.
DataFrame.plot.hist(self[, by, bins])	Draw one histogram of the DataFrame’s columns.
DataFrame.plot.kde(self[, bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.line(self[, x, y])	Plot Series or DataFrame as lines.
DataFrame.plot.pie(self, \*\*kwargs)	Generate a pie plot.
DataFrame.plot.scatter(self, x, y[, s, c])	Create a scatter plot with varying marker point size and color.
DataFrame.boxplot(self[, column, by, ax, …])	Make a box plot from DataFrame columns.
DataFrame.hist(data[, column, by, grid, …])	Make a histogram of the DataFrame’s.
   
   
df.to_html()
df.to_latex()
DataFrame.to_records(self[, index, …])
   

### Plot Histograms

In [None]:
td.filter(regex="Births").hist()

## Basic ops

In [None]:
# Encoding Mechanism
pd.get_dummies(df)

In [None]:
# If the data lends itself to it
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

In [None]:
# DataFrame.shape	Return a tuple representing the dimensionality of the DataFrame.
df3.shape

#DataFrame.size	Return an int representing the number of elements in this object.
df.size

# DataFrame.ndim	Return an int representing the number of axes / array dimensions.
df.ndim

# Note Used : 
# DataFrame.axes	Return a list representing the axes of the DataFrame.

df3.dtypes

# Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
df.kurtosis()

In [None]:
# Available Operations
df.describe()
DataFrame.median([axis, skipna, level, …])	# Return the median of the values for the requested axis.
DataFrame.mode([axis, numeric_only, dropna])	# Get the mode(s) of each element along the selected axis.
DataFrame.sem([axis, skipna, level, ddof, …])	# Return unbiased standard error of the mean over requested axis.
DataFrame.nunique([axis, dropna])	# Count distinct observations over requested axis.
DataFrame.var([axis, skipna, level, ddof, …])	# Return unbiased variance over requested axis.
DataFrame.drop_duplicates([subset, keep, …])	# Return DataFrame with duplicate rows removed, optionally only considering certain columns.

##  Categorical Analysis:

Count, Unique, Top, Frequency

In [None]:
# 
# Count each Unique 
# Plot a histogram
# 
def getCategoricalCounts(df):
  df = df.iloc[1:]
  for column in df:
    print(column)
    return df[column]
    #print( df[ df.columns[column] ].counts() )
  # return df.value_counts().plot.bar(title="Catagorical Distribution")
  # print(sales_data[sales_data['File_Type'] == 'Historical']['SKU_number'].count())
# tmp = df3[ df3.columns[0] ]
# getCategoricalCounts( tmp )
# df3.select_dtypes(include=['int','float'] )
# df3.describe(include='all')

In [None]:
# Print the value counts for categorical columns
for col in df.columns:
    if df[col].dtype == 'object':
        print('\nColumn Name:', col,)
        print(df[col].value_counts())

In [None]:
### Distribution by Categorical Column Values

# Grade distribution by address
sns.kdeplot(df.ix[df['address'] == 'U', 'Grade'], label = 'Urban', shade = True)
sns.kdeplot(df.ix[df['address'] == 'R', 'Grade'], label = 'Rural', shade = True)
plt.xlabel('Grade'); plt.ylabel('Density'); plt.title('Density Plot of Final Grades by Location');

Variable Correlations with Final Grade
# Correlations of numerical values
df.corr()['Grade'].sort_values()
Categorical Correlations using One-Hot Encoding
# Select only categorical variables
category_df = df.select_dtypes('object')
# One hot encode the variables
dummy_df = pd.get_dummies(category_df)
# Put the grade back in the dataframe
dummy_df['Grade'] = df['Grade']
# Correlations in one-hot encoded dataframe
dummy_df.corr()['Grade'].sort_values()

In [None]:
# A Starter Pack to Exploratory Data Analysis with Python, pandas, seaborn, and scikit-learn

# https://towardsdatascience.com/a-starter-pack-to-exploratory-data-analysis-with-python-pandas-seaborn-and-scikit-learn-a77889485baf
  
def categorical_summarized(dataframe, x=None, y=None, hue=None, palette='Set1', verbose=True):
    '''
    Helper function that gives a quick summary of a given column of categorical data
    Arguments
    =========
    dataframe: pandas dataframe
    x: str. horizontal axis to plot the labels of categorical data, y would be the count
    y: str. vertical axis to plot the labels of categorical data, x would be the count
    hue: str. if you want to compare it another variable (usually the target variable)
    palette: array-like. Colour of the plot
    Returns
    =======
    Quick Stats of the data and also the count plot
    '''
    if x == None:
        column_interested = y
    else:
        column_interested = x
    series = dataframe[column_interested]
    print(series.describe())
    print('mode: ', series.mode())
    if verbose:
        print('='*80)
        print(series.value_counts())

    sns.countplot(x=x, y=y, hue=hue, data=dataframe, palette=palette)
    plt.show()
    
# Target Variable: Survival
# Univariate Analysis
c_palette = ['tab:blue', 'tab:orange']
categorical_summarized(train_df, y = 'Survived', palette=c_palette)


# Feature Variable: Gender
# Bivariate Analysis
categorical_summarized(train_df, y = 'Sex', hue='Survived', palette=c_palette)

##  Numeric Analysis:

In [None]:
# A Starter Pack to Exploratory Data Analysis with Python, pandas, seaborn, and scikit-learn

# https://towardsdatascience.com/a-starter-pack-to-exploratory-data-analysis-with-python-pandas-seaborn-and-scikit-learn-a77889485baf

def quantitative_summarized(dataframe, x=None, y=None, hue=None, palette='Set1', ax=None, verbose=True, swarm=False):
    '''
    Helper function that gives a quick summary of quantattive data
    Arguments
    =========
    dataframe: pandas dataframe
    x: str. horizontal axis to plot the labels of categorical data (usually the target variable)
    y: str. vertical axis to plot the quantitative data
    hue: str. if you want to compare it another categorical variable (usually the target variable if x is another variable)
    palette: array-like. Colour of the plot
    swarm: if swarm is set to True, a swarm plot would be overlayed
    Returns
    =======
    Quick Stats of the data and also the box plot of the distribution
    '''
    series = dataframe[y]
    print(series.describe())
    print('mode: ', series.mode())
    if verbose:
        print('='*80)
        print(series.value_counts())

    sns.boxplot(x=x, y=y, hue=hue, data=dataframe, palette=palette, ax=ax)

    if swarm:
        sns.swarmplot(x=x, y=y, hue=hue, data=dataframe,
                      palette=palette, ax=ax)

    plt.show()
    
# quantitative_summarized can take in one Quantitative Variable and up to two Categorical Variables, where the Quantitative Variable has to be assigned to y and the other two Categorical Variables can be assigned to x and hue respectively. 

# univariate analysis
quantitative_summarized(dataframe= train_df, y = 'Age', palette=c_palette, verbose=False, swarm=True)

# bivariate analysis with target variable
quantitative_summarized(dataframe= train_df, y = 'Age', x = 'Survived', palette=c_palette, verbose=False, swarm=True)


# multivariate analysis with Embarked variable and Pclass variable
quantitative_summarized(dataframe= train_df, y = 'Age', x = 'Embarked', hue = 'Pclass', palette=c_palette3, verbose=False, swarm=False)



# Geo

**Future Self Service Tool**

Data analytics
1. Self Service
2. Reccurent Reports 
3. Embedded Analytics.

**GisHandler**() 

- Check Columns
- Check If Operations will work as expected
- perform operations
- tidy up
- save
- return

**Main**( Check For Missing Values, Perform Operation)

**readFile**() - csv/postgis -df -reverseGeocode? ColumnToCords? -Geodf

**Geodataframe** -toCrs, - saveGeoDataFrame

**MergeBounds**() 

**FilterBounds**() 

**FilterPoints**() Bounds Points 

**PoinsInPoly**()


**Applied Spatial Statistics**
- Prior Posterior Distribution
- Hierarchal Models
- Markov Chain Monte Carlo
- Kernal Methods
- Dynamic State Space Modeling
- Multiple linear Regressions
- Spatial Models (Car Sar) Kriging
- Time series models: ARM ARMA 
- Dynamic linear models
- multi level models - causal inference - meta analysis
- multi agent decision making
- variable transformations
- eigenvalues

**Applied Spatial Statistics** -\&gt; Prior/Posteriors, MCMC, Kernel methods, dynamic state space modeling, multiple linear regression, multilevel models(causal inference, meta analysis), multi agent decision making, variable transformations, eigenvalues,
**Spatial models** (Car,Sar) Kriging
**Time Series Models** : ARM ARMA Dynamic linear models

Exploratory spatial analysis, spatial autocorrelation, spatial regression, interpolation, grid based stats, point based stats, spatial network analysis, spatial clustering.

 Big-Data, Structure(Semi/Un), Time-Stamped, Spatial, Spatio-Temporal, Ordered, Stream, Dimensionality,
 Primary Keys, Unique Values, Index, Spatial, Auto Increment, Default Values, Null Values


**Geographic Inquery**:
- Describe real world phenomena
- Study of Spatial Arrangement of features
- Patterns arise as a result of process operating within space
- Measure compare generate
- Size distribution pattern contiguity shape community scale orientation relation
- How comparE? How describe analyze? How predict?
- Entry, conversion, storage, query, manipulation, analysis, presentation,
- Req, process, clean ,explore, model …
- Hot spot analysis \_\&gt; cluster points
- Line of sight/visibility analysis -\&gt; network, overlay, proximity, risk
- Heat maps
- GeoCoding
- Distance Decay
- Clip Analysis
- post analaysis
- land use analysis
- voronoi crop by bounds of other ds,
- Buffering -radius around a point

- Map coverage, 
  spatial resource allocation, 
- impact assesment, 
- pollutant reduction, 
- decision support, 
- facility management (water plant mgmt), 
- operations mgmt, 
- site selection - where to do xyz, 
- business/marketing

http://pysal.org/notebooks/explore/esda/Spatial_Autocorrelation_for_Areal_Unit_Data.html	
Python Spatial Analysis library.	
https://pysal.org/notebooks/intro	Python Spatial Analysis library.
Shape Analysis	
hull: calculate the convex hull of the point pattern	
mbr: calculate the minimum bounding box (rectangle)	
The python file centrography.py contains several functions with which we can conduct centrography analysis.	
	
Random point patterns are the outcome of CSR. https://en.wikipedia.org/wiki/Complete_spatial_randomness CSR has two major characteristics:	
Uniform: each location has equal probability of getting a point (where an event happens)	
Independent: location of event points are independent	
It usually serves as the null hypothesis in testing whether a point pattern is the outcome of a random process.	
There are two possible objectives in a discriminant analysis:	
- finding a predictive equation for classifying new individuals	
- interpreting the predictive equation to better understand the relationships that may exist among the variables.	
It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR.	

# Misc

https://www.gnu.org/philosophy/open-source-misses-the-point.html

It seems to me that the chief difference between the MIT license and GPL is that the MIT doesn't require modifications be open sourced whereas the GPL does. 

You don't have to open-source your changes if you're using GPL.	You could modify it and use it for your own purpose as long as you're not distributing it	

BUT... 

if you DO distribute it, then your entire project that is using the GPL code also becomes GPL automatically Which means, it must be open-sourced, and the recipient gets all the same rights as you - meaning, they can turn around and distribute it, modify it, sell it, etc. 

And that would include your proprietary code which would then no longer be proprietary - it becomes open source.

with MIT is that even if you actually distribute your proprietary code that is using the MIT licensed code you do not have to make the code open source you can distribute it as a closed app where the code is encrypted or is a binary.

	Including the MIT-licensed code can be encrypted, as long as it carries the MIT license notice. 
	

- File-\&gt;UncleanData-\&gt;ToCsvFormat(filename,data)
- ProcessCsv -\&gt; Unclean Data
- IndexDB
- URL-\&gt;browser or server? callServer(url)
- json/geojson/xl/csv -\&gt; tocsvformat -\&gt; iscsv-\&gt;stringreplace, isjson-\&gt;papaunparse, isxl-\&gt;readxlsx[0] -\&gt;tocsv, isgeoj-\&gt;json-\&gt;papaunparce
- JSN.Parse at runtime is faster than inlining the data when 10KB\&gt;
- Code Caching occurs when inlineJs \&gt; 1KB
- V8 reduced parse/compilation by 40% using workerThreads
- /v8RawJS parse speed is 2x since chrome60

Clear indexdb -\&gt; readFile. Insert into IndexDB V.1.0


- jpl- sweet ontology
- Geoincubator group
- Rdf, qsparql, gml, kml