# Explore CM1 Dataset

## Detail about CM1

CM1 is a NASA spacecraft instrument (data collection and processing) written in “C”. At various times, researchers have negotiated access to the CM source code.

## Some guidelines

1. Look at the raw data
1. Load dataset
1. Summarize the dataset 
1. Correlation between attributes
1. Skewed of univariate distribution
1. Data visualization

### Look at the raw data 

In [None]:
!head -5 NASADefectDataset/OriginalData/csv/cm1.csv

### Load dataset

In [None]:
import pandas as pd

filename = "cm1.csv"
relativepath = 'NASADefectDataset/OriginalData/csv/'

#Get the column header from the csv file
colnames = pd.read_csv(relativepath+filename, nrows=1)

#Set the preferred max_columns of the output
pd.set_option('display.max_columns', 8)        
print(colnames)

#Read everything omitting the 'id' column
data = pd.read_csv(relativepath+filename, usecols = [i for i in colnames if i != 'id'])
peek = data.head(5)
peek

### Summarize the dataset

1. Dimension
1. Data type
1. Missing values in the attributes
1. Descriptive statistics
1. Number of instances for the Defective attribute

In [None]:
# Dimension
# Get how many instances (rows) and how many attributes (columns) are contained in the data
print(data.shape)

In [None]:
# Data Type
print(data.dtypes)

In [44]:
# Identify missing values in any columns
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 41 columns):
LOC_BLANK                          505 non-null int64
BRANCH_COUNT                       505 non-null int64
CALL_PAIRS                         505 non-null int64
LOC_CODE_AND_COMMENT               505 non-null int64
LOC_COMMENTS                       505 non-null int64
CONDITION_COUNT                    505 non-null int64
CYCLOMATIC_COMPLEXITY              505 non-null int64
CYCLOMATIC_DENSITY                 505 non-null float64
DECISION_COUNT                     505 non-null int64
DECISION_DENSITY                   505 non-null object
DESIGN_COMPLEXITY                  505 non-null int64
DESIGN_DENSITY                     505 non-null float64
EDGE_COUNT                         505 non-null int64
ESSENTIAL_COMPLEXITY               505 non-null int64
ESSENTIAL_DENSITY                  505 non-null float64
LOC_EXECUTABLE                     505 non-null int64
PARAMETER_COUNT         

In [None]:
# Descriptive Statistics
#Set the preferred max_columns of the output
pd.set_option('display.max_columns', 4)    
#Set the precision of the output
pd.set_option('precision', 3)

# Look at a summary of each attribute
print(data.describe())

In [29]:
# Look at the number of instances (rows) that belong to each Defective
print(data.groupby('Defective').size())

Defective
N    457
Y     48
dtype: int64


### Correlation between attributes

1. Pearson
1. Spearman
1. Kendall

In [None]:
#Set the preferred max_columns of the output
pd.set_option('display.max_columns', 6)    
#Set the precision of the output
pd.set_option('precision', 3)
pcorrelations = data.corr(method='pearson')
pcorrelations

In [42]:
scorrelations = data.corr(method='spearman')
scorrelations

Unnamed: 0,LOC_BLANK,BRANCH_COUNT,CALL_PAIRS,...,PATHOLOGICAL_COMPLEXITY,PERCENT_COMMENTS,LOC_TOTAL
LOC_BLANK,1.0,0.689,0.665,...,,0.569,0.761
BRANCH_COUNT,0.689,1.0,0.702,...,,0.323,0.895
CALL_PAIRS,0.665,0.702,1.0,...,,0.386,0.754
LOC_CODE_AND_COMMENT,0.661,0.614,0.493,...,,0.646,0.646
LOC_COMMENTS,0.705,0.586,0.655,...,,0.81,0.663
CONDITION_COUNT,0.688,0.978,0.692,...,,0.334,0.871
CYCLOMATIC_COMPLEXITY,0.687,0.999,0.699,...,,0.319,0.896
CYCLOMATIC_DENSITY,-0.175,0.162,-0.14,...,,-0.129,-0.248
DECISION_COUNT,0.694,0.969,0.695,...,,0.338,0.87
DESIGN_COMPLEXITY,0.648,0.841,0.835,...,,0.27,0.81


In [None]:
kcorrelations = data.corr(method='kendall')
kcorrelations

### Skewed of univariate distributions

In [46]:
skew = data.skew()
print(skew)

LOC_BLANK                           4.067
BRANCH_COUNT                        5.296
CALL_PAIRS                          2.558
LOC_CODE_AND_COMMENT                4.595
LOC_COMMENTS                        6.359
CONDITION_COUNT                     4.818
CYCLOMATIC_COMPLEXITY               5.764
CYCLOMATIC_DENSITY                  1.307
DECISION_COUNT                      4.723
DESIGN_COMPLEXITY                   5.316
DESIGN_DENSITY                     -0.936
EDGE_COUNT                          4.579
ESSENTIAL_COMPLEXITY                4.229
ESSENTIAL_DENSITY                   1.909
LOC_EXECUTABLE                      4.880
PARAMETER_COUNT                     2.975
GLOBAL_DATA_COMPLEXITY              0.000
GLOBAL_DATA_DENSITY                 0.000
HALSTEAD_CONTENT                    2.667
HALSTEAD_DIFFICULTY                 2.592
HALSTEAD_EFFORT                    11.320
HALSTEAD_ERROR_EST                  4.919
HALSTEAD_LENGTH                     4.007
HALSTEAD_LEVEL                    

### Data visualization

1. box
1. hist

In [None]:
import matplotlib.pyplot as plt
# Univariate plots of each individual variable
# box with whisker
data.plot(kind='box', subplots=True, layout=(8,7),  sharex=False, sharey=False)
plt.rcParams["figure.figsize"] = [16,16]
plt.show()

In [None]:
# Univariate plots of each individual variable
# hist
data.hist()
plt.rcParams["figure.figsize"] = [16,16]
plt.show()

In [None]:
# Multivariate plots to look at the interactions between the variables
pd.scatter_matrix(data) 
plt.show()