# Data preprocess Pipeline

The purpose of this data process pipeline is to generate a DataFrame for FATP return board machine learning modeling.

## Preprocess steps
Step 1. Download necessary data (using HIVE) and tranform to DataFrame <br />
    - FCT from test log csv (1 week)
    - GateKeeper from Bobcat (2 weeks)
    - RPC (4 weeks)
Step 2. Join and filter data (X) <br />
Step 3. Filter RPC data (y) <br />
Step 4. Extract FCT test values (FCT['items']) and store it as a separate DataFrame <br />
Step 5. Missing value handling before data scalling <br />
Step 6. Data Scalling (Normalization, Max-Min Scalling) <br />
Step 7. Missing value handling after data scalling <br />

### Step 1. Download necessary data and tranform to DataFrame

#### a. ssh log in server **(10.195.223.53)** and download test log data with user specified date and period with HIVE, e.g station = **FCT**, date = 2015-10-26, period = 6 days (a week)

In [1]:
!ssh mlb@10.195.223.53 "hive -e \"use cpk; select * from mlb_test_log_detail \
                        where station = 'FCT'\
                        and model = 'N66'\
                        and hour between '2015-10-29_00' and '2015-10-29_23';\"" \
                        > Data/FCT_20151029.log

15/12/11 08:30:01 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist
15/12/11 08:30:01 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/12/11 08:30:01 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
15/12/11 08:30:01 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.noconditionaltask does not exist

Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.6.0-2800/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.6.0-2800/hive/lib/hive-jdbc-0.14.0.2.2.6.0-2800-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
OK
Time taken: 2

#### b. Download **BOBCAT** data with same date and period.

In [47]:
!ssh mlb@10.195.223.53 "hive -e \"use cpk; select * from mlb_bobcat_raw \
                        where model = 'Agera'\
                        and day between '2015-10-26' and '2015-11-07';\"" \
                        > Data/Bobcat_20151026.log

15/11/24 14:44:35 WARN conf.HiveConf: HiveConf of name hive.optimize.mapjoin.mapreduce does not exist
15/11/24 14:44:35 WARN conf.HiveConf: HiveConf of name hive.heapsize does not exist
15/11/24 14:44:35 WARN conf.HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
15/11/24 14:44:35 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.noconditionaltask does not exist

Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.6.0-2800/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.6.0-2800/hive/lib/hive-jdbc-0.14.0.2.2.6.0-2800-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
OK
Time taken: 2

#### c. Download RPC data with same starting date but period = 4 weeks

In [49]:
!ssh mlb@zz2 "hive -e \"use cpk; select * from rpc_file\
                        where day between '2015-10-26' and '2015-11-21';\"" \
                        > Data/rpc_20151026.log


Logging initialized using configuration in file:/etc/hive/conf/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.8.0-3150/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.2.8.0-3150/hive/lib/hive-jdbc-0.14.0.2.2.8.0-3150-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
OK
Time taken: 11.29 seconds
OK
Time taken: 2.042 seconds, Fetched: 8804653 row(s)


## Step 2. Transform downloaded data to DataFrame

In the following steps, we will use SparkSQL DataFrame to preprocess data

In [2]:
#Import necessary libraries

import findspark
findspark.init('/Users/hadoop1/srv/spark')
import pyspark
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext, Row
import pandas as pd
sc = pyspark.SparkContext()
hc = HiveContext(sc)



In [2]:
fctLines = sc.textFile("Data/FCT_20151026.log")

In [3]:
fctParts = fctLines.map(lambda l: l.split("\t"))

In [4]:
fctBoard = fctParts.map(lambda p: Row(serial_number=p[0], test_result=p[1], fct_test_time=p[2],\
                                     version=p[3],line=p[4],machine=p[5],\
                                     slot=p[7],items=p[8],model=p[10],station=p[11]))

In [5]:
fctBoardDf = hc.createDataFrame(fctBoard)

In [6]:
fctBoardDf.count()

326511

In [6]:
fctBoardDf.show()

+-------------------+--------------------+-----------+-------+-----+----------------+----+-------+-----------+--------------------+
|      fct_test_time|               items|       line|machine|model|   serial_number|slot|station|test_result|             version|
+-------------------+--------------------+-----------+-------+-----+----------------+----+-------+-----------+--------------------+
|2015-10-26 00:03:08|{"VREF_180mV_CHEC...|L06-2FRF-01|      9|  N66|F3Y54361944G360B|   1|    FCT|       PASS|7.26q_Agera_PVT_T...|
|2015-10-26 00:03:08|{"VREF_180mV_CHEC...|L06-2FRF-01|      9|  N66|F3Y5436188QG360B|   2|    FCT|       PASS|7.26q_Agera_PVT_T...|
|2015-10-26 00:03:08|{"VREF_180mV_CHEC...|L06-2FRF-01|      9|  N66|F3Y543619VYG360B|   3|    FCT|       PASS|7.26q_Agera_PVT_T...|
|2015-10-26 00:03:08|{"VREF_180mV_CHEC...|L06-2FRF-01|      9|  N66|F3Y543619W0G360B|   4|    FCT|       PASS|7.26q_Agera_PVT_T...|
|2015-10-26 00:00:23|{"VREF_180mV_CHEC...|L06-2FRF-01|     10|  N66|F3Y54361

**b. Bobcat **

0	wip_no	string	
1	test_time	string	
2	test_hour	string	
3	is_test_fail	string	
4	symptom_code	string	
5	symptom_code_first	string	
6	factory	string	
7	station	string	
8	station_code	string	
9	line	string	
10	machine	string	
11	line_type	string	
12	test_times	int	3
13	rankno	int	
14	fail_count	int	
15	test_result	string	
16	symptom	string	
17	day	string	
18	model	string

In [7]:
bobcatLines = sc.textFile('Data/Bobcat_20151026.log')

In [8]:
bobcatParts = bobcatLines.map(lambda l: l.split('\t'))

In [9]:
bobcatRows = bobcatParts.map(lambda p: Row(sympton = p[4], serial_num = p[0],test_time=p[1],station=p[7],test_result=p[15]))

In [10]:
bobcatDf = hc.createDataFrame(bobcatRows)

In [11]:
bobcatDf.show()

+----------------+------------------+--------------------+-----------+-------------------+
|      serial_num|           station|             sympton|test_result|          test_time|
+----------------+------------------+--------------------+-----------+-------------------+
|F3Y54411DDKG35WB|          CELL-CAL|                    | First Pass|2015-10-26 18:15:57|
|F3Y54411XJWG360B|GATEKEEPER-PREBURN|                    | First Pass|2015-10-26 20:51:26|
|F3Y54411DDEG360B|          CELL-CAL|                    | First Pass|2015-10-26 13:31:30|
|F3Y54411DCZG35WB|               FCT|                    | First Pass|2015-10-26 18:37:43|
|F3Y54412JPBG35WB|     DFU-NAND-INIT|                    | First Pass|2015-10-26 18:09:08|
|F3Y54411DCTG35WB|GATEKEEPER-PREBURN|                    |Retest Pass|2015-10-26 22:31:09|
|F3Y54411DCTG35WB|GATEKEEPER-PREBURN|CB Error; SMT QT ...|Retest Pass|2015-10-26 21:09:53|
|F3Y54411DCNG360B|GATEKEEPER-PREBURN|                    | First Pass|2015-10-26 20:42:48|

**c. RPC Data**


 	Name	Type	Comment
0	namec	string	
1	station_code	string	
2	serial_num	string	
3	add_date	string	
4	emp	string	
5	station_type	string	
6	fail_location	string	
7	code	string	
8	desce	string	
9	other	string	
10	day	string

In [13]:
rpcLines = sc.textFile('Data/rpc_20151026.log')

In [14]:
rpcParts = rpcLines.map(lambda l: l.split('\t'))

In [15]:
rpcRows = rpcParts.map(lambda p: Row(namec = p[0],serial_num=p[2], add_date = p[3], emp = p[4],\
                                     fail_location=p[6], code = p[7], desce = p[8], day = p[10],\
                                     ))

In [16]:
rpcDf = hc.createDataFrame(rpcRows)

In [17]:
rpcDf.show()

+-------------------+----+----------+-----+---+-------------+---------+----------------+
|           add_date|code|       day|desce|emp|fail_location|    namec|      serial_num|
+-------------------+----+----------+-----+---+-------------+---------+----------------+
|2015-10-26 00:00:20|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y54331AFQG360B|
|2015-10-26 00:00:21|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y542601WQGKL0B|
|2015-10-26 00:00:21|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y543209DPGKL0B|
|2015-10-26 00:00:22|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y543302J0GKKYB|
|2015-10-26 00:00:22|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y543505FQGKKYB|
|2015-10-26 00:00:23|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y54312XXLG35WB|
|2015-10-26 00:00:24|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y543300Z7GKKYB|
|2015-10-26 00:00:24|    |2015-10-26|     |徐騰飛|             |Check Out|F3Y54310JV4GKL0B|
|2015-10-26 00:00:25|

### Step 3. Join and filter data (X)
Now we have 3 dataframes, fctBoardDf, bobcatDf and rpcDf.

In [18]:
fctBoardDf

DataFrame[fct_test_time: string, items: string, line: string, machine: string, model: string, serial_number: string, slot: string, station: string, test_result: string, version: string]

In [19]:
bobcatDf

DataFrame[serial_num: string, station: string, sympton: string, test_result: string, test_time: string]

In [20]:
rpcDf

DataFrame[add_date: string, code: string, day: string, desce: string, emp: string, fail_location: string, namec: string, serial_num: string]

And 3 temp tables for sql 

In [21]:
fctBoardDf.registerTempTable("fctBoardDfTemp")

In [22]:
bobcatDf.registerTempTable("bobcatDfTemp")

In [23]:
rpcDf.registerTempTable("rpcDfTemp")

### a. Join FCT ['test_result']==pass and Bobcat GATEKEEPER ['test_result']==First Pass I ['test_result']==Retest Pass  DataFrames on serial_num

First filter fctBoardDf DF with only PASS results, and verify its numbers.

In [24]:
fctBoardPassDf = fctBoardDf.filter(fctBoardDf.test_result == 'PASS')

In [25]:
fctBoardFailDf = fctBoardDf.filter(fctBoardDf.test_result == 'FAIL')

In [70]:
322198+4313

326511

Use bobcatGkDf and filter out serial numbers that passed 'GATEKEEPER-PREBURN' stations. This filtered DF then work as a mask for fctBoardDfPass to make sure all serial numbers are passed at the last station.

In [26]:
bobcatGkPassDf = bobcatDf.filter((bobcatDf.station == 'GATEKEEPER-PREBURN'))\
                         .filter((bobcatDf.test_result == 'First Pass') | 
                                   (bobcatDf.test_result == 'Retest Pass'))

In [27]:
fctBoardPassDf.registerTempTable("fctBoardPassDfTemp")

In [28]:
bobcatGkPassDf.registerTempTable("bobcatGKPassDfTemp")

In [29]:
fctGateKeeper = fctBoardPassDf.join(bobcatGkPassDf, fctBoardPassDf.serial_number == bobcatGkPassDf.serial_num)

In [40]:
fctGateKeeper.count()

250232

fctGateKeeper, fctGateKeeperSql are DFs that contain records that have passed FCT and GateKeeper test stations. 

### b. Join fctGateKeeperSql DF with bobcatDfFctPass DF on [serial_num] and [test_start_time] columns. 

First join fctGateKeeper DF with bobcatDf DF to infer "First Pass" and "Retest Pass" information. 

In [30]:
bobcatFctFirstPassDf = bobcatDf.filter(bobcatDf.station == 'FCT')\
                           .filter(bobcatDf.test_result == 'First Pass')

In [31]:
fctGateKeeper.registerTempTable("fctGateKeeperTemp")

In [32]:
bobcatFctFirstPassDf.registerTempTable("bobcatFctFirstPassDfTemp")

In [33]:
fctGateKeeperFirstPassDf = hc.sql("select F.serial_number, F.line, F.machine,\
                                      F.model, F.slot, F.fct_test_time, F.items, B.test_result\
                                      from fctGateKeeperTemp F\
                                      inner join bobcatFctFirstPassDfTemp B\
                                      on F.serial_number =\
                                      B.serial_num and F.fct_test_time = B.test_time")

In [34]:
fctGateKeeperFirstPassDf.registerTempTable("fctGateKeeperFirstPassDfTemp")

In [49]:
fctGateKeeperFirstPassDf.count()

227232

## Add a new column specify item numbers

The functions should be import from pyspark.sql, then we can use functions.udf to defince User Defined Functions. 

In [35]:
from pyspark.sql import functions

When defining user defiend funciton, the type of the returned value should be specified beforehand, therefore the IntegerType shold also be imported.

In [36]:
from pyspark.sql.types import IntegerType

In [37]:
import ast

In [39]:
sparkItemToNum = functions.udf(lambda items: len(ast.literal_eval(items)), IntegerType())

In [40]:
fctGateKeeperFirstPassItemnumDf = fctGateKeeperFirstPassDf.withColumn('item_num',\
                                  sparkItemToNum(fctGateKeeperFirstPassDf.items))

In [41]:
fctGateKeeperFirstPass799Df = fctGateKeeperFirstPassItemnumDf.filter\
                        (fctGateKeeperFirstPassItemnumDf.item_num == 799)

In [42]:
fctGateKeeperFirstPass799Df.registerTempTable("fctGateKeeperFirstPass799DfTemp")

In [43]:
rpcDf.registerTempTable("rpcDfTemp")

In [44]:
fctGatekeeperAllpass799 = hc.sql("select * from fctGateKeeperFirstPass799DfTemp F\
                               left outer join rpcDfTemp R\
                               on F.serial_number = R.serial_num\
                               where R.serial_num is null")

In [46]:
fctGatekeeperCRBFail799 = hc.sql("select * from fctGateKeeperFirstPass799DfTemp F\
                            left semi join rpcDfTemp R\
                            on F.serial_number = R.serial_num\
                            and R.namec = 'CRB Check In'")

### Step 4. Filter RPC data (y)

#### a. Filter RPC DataFrame by ['namec'] column and separate ['namec']=='TFB Check In' and ['namec']=='CRB Check In'. We need only 'CRB Check In' records. 

#### b. Filter RPC records by 'NTF' and 'Replaced'.

#### c. Filter 'Replaced' records by 'AP' and 'RF'

Note: AP-Application, RF-Radio Frequency
      FCT test items are mainly for AP. 

#### d. Filter RPC DataFrame by ['serial_num'].isin(above DataFrame['serial_num])

#### e. Join above DataFrame with RPC DataFrame to identify FATP return records. 

### Step 5. Extract FCT test values (FCT['items']) and store it as a separate DataFrame

#### a. The test values of FCT test station are stored as mapped file within FCT['items'] column. Extract these values and stored them as a separate DataFrame. Remaining columns of original DataFrame are stored as a metadata DataFrame.

**At this point, we have joined and generated 2 dataframes:**
- fctGatekeeperAllpass(223041)
- fctGatekeeperCRBFail(3126)

Both dataframes have 1 column called 'items' that contain FCT test log values, that will be used for building anomaly detection model. We will first extract FCT test log values from this 2 dataframes and stored them as a separate dataframe.  

In [47]:
import ast

In [48]:
def dic_to_row(record):
    schema = {'{i:s}'.format(i = key):record[key] for key in record}
    return Row(**schema)

In [49]:
itemsCRBFailRow = fctGatekeeperCRBFail799.map(lambda row: row.items)\
                                .map(lambda s: ast.literal_eval(s))\
                                .map(lambda d: dic_to_row(d))

In [50]:
itemsCRBFailRow.take(1)

[Row(BATT_VCC_SHORT_CHECK='0.096597', CT_BT_delta='6.375548999999921', CT_BT_off='275.594482', CT_BT_on='277.5264279999999', CT_Phosphorus_baseline='230.7730409999999', CT_Phosphorus_delta='1.545547000000027', CT_Phosphorus_sampling='232.318588', CT_acc_load_delta='112.5367739999999', CT_acc_load_off='230.483246', CT_acc_load_on='343.0200199999999', CT_accel_delta='0.965972999999991', CT_accel_off='230.676437', CT_accel_sampling='231.64241', CT_arc_boost_8V_delta='79.59686199999998', CT_arc_boost_8V_on='342.247253', CT_arc_delta='32.07054200000001', CT_arc_off='230.5798489999999', CT_arc_on='262.650391', CT_bl_baseline='230.0968629999999', CT_bl_high_22000='593.691956', CT_bl_high_22000_delta='363.595093', CT_bl_low_5500='318.5805969999999', CT_bl_low_5500_delta='88.48373399999996', CT_bl_med_12600='434.3050539999999', CT_bl_med_12600_delta='204.2081909999999', CT_bl_off='230.2900539999999', CT_bl_off_delta='0.1931909999999845', CT_buck6_off='230.386658', CT_buck6_pfm='230.483246', CT_

In [53]:
itemsCRBFailDf = hc.createDataFrame(itemsCRBFailRow)

In [71]:
itemsCRBFailDf.take(1)

[Row(BATT_VCC_SHORT_CHECK=u'0.096597', CT_BT_delta=u'6.375548999999921', CT_BT_off=u'275.594482', CT_BT_on=u'277.5264279999999', CT_Phosphorus_baseline=u'230.7730409999999', CT_Phosphorus_delta=u'1.545547000000027', CT_Phosphorus_sampling=u'232.318588', CT_acc_load_delta=u'112.5367739999999', CT_acc_load_off=u'230.483246', CT_acc_load_on=u'343.0200199999999', CT_accel_delta=u'0.965972999999991', CT_accel_off=u'230.676437', CT_accel_sampling=u'231.64241', CT_arc_boost_8V_delta=u'79.59686199999998', CT_arc_boost_8V_on=u'342.247253', CT_arc_delta=u'32.07054200000001', CT_arc_off=u'230.5798489999999', CT_arc_on=u'262.650391', CT_bl_baseline=u'230.0968629999999', CT_bl_high_22000=u'593.691956', CT_bl_high_22000_delta=u'363.595093', CT_bl_low_5500=u'318.5805969999999', CT_bl_low_5500_delta=u'88.48373399999996', CT_bl_med_12600=u'434.3050539999999', CT_bl_med_12600_delta=u'204.2081909999999', CT_bl_off=u'230.2900539999999', CT_bl_off_delta=u'0.1931909999999845', CT_buck6_off=u'230.386658', CT

In [349]:
itemsCRBFailDf.count()

3126

In [54]:
itemsAllpassRow = fctGatekeeperAllpass799.map(lambda row: row.items)\
                                      .map(lambda items: ast.literal_eval(items))\
                                      .map(lambda dic: dic_to_row(dic))

In [55]:
itemsAllpassDf = hc.createDataFrame(itemsAllpassRow)

In [352]:
itemsAllpassDf.count()

223041

In [75]:
itemsAllpassDf.take(1)

[Row(BATT_VCC_SHORT_CHECK=u'0.097813', CT_BT_delta=u'6.162506000000007', CT_BT_off=u'289.7260739999999', CT_BT_on=u'290.802002', CT_Phosphorus_baseline=u'242.383911', CT_Phosphorus_delta=u'1.662826999999993', CT_Phosphorus_sampling=u'244.046738', CT_acc_load_delta=u'114.6381229999999', CT_acc_load_off=u'242.2861019999999', CT_acc_load_on=u'356.9242249999999', CT_accel_delta=u'0.8803100000000086', CT_accel_off=u'242.481735', CT_accel_sampling=u'243.362045', CT_arc_boost_8V_delta=u'79.81661999999994', CT_arc_boost_8V_on=u'354.8702999999999', CT_arc_delta=u'32.66976900000003', CT_arc_off=u'242.383911', CT_arc_on=u'275.05368', CT_bl_baseline=u'241.797043', CT_bl_high_22000=u'605.17688', CT_bl_high_22000_delta=u'363.379837', CT_bl_low_5500=u'329.7319639999999', CT_bl_low_5500_delta=u'87.93492099999994', CT_bl_med_12600=u'445.3485109999999', CT_bl_med_12600_delta=u'203.5514679999999', CT_bl_off=u'242.188293', CT_bl_off_delta=u'0.3912500000000136', CT_buck6_off=u'242.2861019999999', CT_buck6_

In [76]:
itemsAllpassDf.show()

+--------------------+-----------------+-----------------+-----------------+----------------------+-------------------+----------------------+-----------------+-----------------+-----------------+------------------+-----------------+-----------------+---------------------+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------+-----------------+--------------------+-----------------+---------------------+-----------------+-------------------+-----------------+-----------------+-------------------+-----------------+------------------+-----------------+-------------------+-----------------+----------------------+-----------------+-----------------------+-----------------+-----------------+------------------+-----------------+-------------------+-----------------+-----------------+--------------------+-----------------+--------------------+-----------------+--------------------+-----------------+------------------

In [78]:
itemsAllpassPdf = fctGatekeeperCRBFail799.toPandas()

Traceback (most recent call last):
  File "/Users/hadoop1/anaconda/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/Users/hadoop1/anaconda/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/Users/hadoop1/anaconda/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/Users/hadoop1/anaconda/lib/python2.7/SocketServer.py", line 655, in __init__
    self.handle()
  File "/Users/hadoop1/srv/spark/python/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/Users/hadoop1/srv/spark/python/pyspark/serializers.py", line 545, in read_int
    raise EOFError
EOFError
ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File "/Users/hadoop1/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 49830)
----------------------------------------


Py4JNetworkError: An error occurred while trying to connect to the Java server

** At this point, we have separate test log as a different dataframe, therefore now we have 4 dataframes:**

(without excluding S/N having less than 799 items) 
1. fctGatekeeperAllpass (223041)
2. itemsAllpassDf (223041) 
3. fctGatekeeperCRBFail (3126)
4. itemsCRBFailDf (3126)

(after excluding S/N having less than 799 items)
1. fctGatekeeperAllpass799 (222836)
2. itemsAllpassDf (222836)
3. fctGatekeeperCRBFail799 (3117)
4. itemsCRBFailDf (3117)

#### b. Examine extracted FCT test value DataFrame to make sure all records are in the same lengh. 

In [357]:
len(itemsAllpassDf.take(1)[0])

799

In [362]:
itemNumAllpass = itemsAllpassDf.map(lambda row: len(row))\
                               .filter(lambda a: a<799)

In [365]:
itemNumAllpass.count()

205

#### c. Drop off records with fewer or missing test items. If there were records being dropped, remove relevant records with record index to make sure test item value DataFrame and meta data DataFrame are having same record number. 

### Step 6. Missing value handling before data scalling

#### a. Examine missing value.

### Step 7. Data Scalling (Normalization, Max-Min Scalling)

#### a. Rescale FCT test log DataFrame with normalization and max-min scalling.

### Step 8. Missing value handling after data scalling

#### a. Replace missing value with column.min()

Note: Most of the observed missing value were due to incorrect scientific notation. Whenver the number is too small, the scientific notation will be displayed incorrectly, therefore we take min() value of each column to fill out missing values. 