# Lab 1 - Attributes and Visualization

Team: Frank Sclafani, Jan Shook, and Leticia Valadez

# Business Understanding

## Rubric (10 pts)

This initial phase focuses on understanding the project objective and requirement from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

> Describe the purpose of the data set you selected (i.e., why was this data collected in the ﬁrst place?). Describe how you would deﬁne and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be speciﬁc. 

## TV News Channel Commercial Detection

Our team selected this dataset for two reasons: 1) It has a large number of instances (129,685, which is greater than the requirement of at least 30,000) and enough attributes (14, which is greater than the requirement of at least 10), and 2) It looks like an interesting dataset (detecting commercials). Initial questions of interest are how do you detect commercials from this data? Can a model be trained to detect and skip (or remove) commercials? If so, would this solution be robust enough for commercial products like TiVo?

This dataset is from the UCI Machine Learning website (https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset). It consists of popular audio-visual features of video shots extracted from 150 hours of TV news broadcast of 3 Indian and 2 international news channels (30 Hours each). In the readme accompanying the data, the authors describe the potential benefits of this data as follows:

> Automatic identification of commercial blocks in news videos finds a lot of applications in the domain of television broadcast analysis and monitoring. Commercials occupy almost 40-60% of total air time. Manual segmentation of commercials from thousands of TV news channels is time consuming, and economically infeasible hence prompts the need for machine learning based Method. Classifying TV News commercials is a semantic video classification problem. TV News commercials on particular news channel are combinations of video shots uniquely characterized by audio-visual presentation. Hence various audio visual features extracted from video shots are widely used for TV commercial classification. Indian News channels do not follow any particular news presentation format, have large variability and dynamic nature presenting a challenging machine learning problem. Features from 150 Hours of broadcast news videos from 5 different (3 Indian and 2 International News channels) news channels. Viz. CNNIBN, NDTV 24X7, TIMESNOW, BBC and CNN are presented in this dataset. Videos are recorded at resolution of 720 X 576 at 25 fps using a DVR and set top box. 3 Indian channels are recorded concurrently while 2 International are recorded together. Feature file preserves the order of occurrence of shots.

Given this information, is the subset of Indian datasets really different from the international datasets? If so, can commercials still be identified from both Indian and international datasets the same way?


## About this Notebook

This Jupyter (v4.3.0) notebook was developed on Windows 10 Pro (64 bit) using Anaconda v4.4.7 and Python v3.*.

Packages associated with Anaconda were extracted as follows:

> conda install -c anaconda pandas

> conda install -c anaconda numpy 

In addition to the packages in Anaconda (and outside of the Anaconda ecosystem), this notebook uses Plotly (v2.2.3) for visualization. The zip file for Plotly can be found on GitHub at (https://github.com/plotly/plotly.py). You can install the Plotly packages as follows:

> pip install plotly

> pip install cufflinks

The version of Pandas and its dependencies are shown below.

In [1]:
import pandas as pd

%time pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Wall time: 3.87 s


# Data Understanding

## Rubric (80 pts)

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

> [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data ﬁle.

> [15 points] Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justiﬁcations for your methods.

> [10 points] Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful.

> [15 points] Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.  
Page ! of ! 17 39
 
> [15 points] Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

> [10 points] Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classiﬁcation).

> [5 points] Are there other features that could be added to the data or created from existing features?  Which ones?

## Exceptional Work Rubric (10 pts)

> [10 points total] You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.


## About this Dataset (Summary)

This project is comprised of five datasets (bbc.txt, cnn.txt, cnnibn.txt, ndtv.txt, and timesnow.txt), all found at the UCI Machine Learning webset at https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset. Combined, these five datasets have 129,685 instances (rows) and 14 attributes. As shown in the example record below, most of these attributes have multiple data points (often hundreds) and almost all of these values are floating point.

> 1  1:123 2:1.316440 3:1.516003 4:5.605905 5:5.346760 6:0.013233 7:0.010729 8:0.091743 9:0.050768 10:3808.067871 11:702.992493 12:7533.133301 13:1390.499268 14:971.098511 15:1894.978027 16:114.965019 17:45.018257 18:0.635224 19:0.095226 20:0.063398 21:0.061210 22:0.038319 23:0.018285 24:0.011113 25:0.007736 26:0.004864 27:0.004220 28:0.003273 29:0.002699 30:0.002553 31:0.002323 32:0.002108 33:0.002036 34:0.001792 35:0.001553 36:0.001250 37:0.001317 38:0.001084 39:0.000818 40:0.000624 41:0.000586 42:0.000529 43:0.000426 44:0.000359 45:0.000446 46:0.000268 47:0.000221 48:0.000154 49:0.000217 50:0.000193 51:0.000163 52:0.000165 53:0.000210 54:0.000114 55:0.000130 56:0.000055 57:0.000013 58:0.733037 59:0.133122 60:0.041263 61:0.019699 62:0.010962 63:0.006927 64:0.004525 65:0.003128 66:0.002314 67:0.001762 68:0.001361 69:0.001065 70:0.000914 71:0.000777 72:0.000667 73:0.000565 74:0.000520 75:0.000467 76:0.000469 77:0.000486 78:0.000417 79:0.000427 80:0.000349 81:0.000258 82:0.000262 83:0.000344 84:0.000168 85:0.000163 86:0.001058 90:0.020584 91:0.185038 92:0.148316 93:0.047098 94:0.169797 95:0.061318 96:0.002200 97:0.010440 98:0.004463 100:0.010558 101:0.002067 102:0.338970 103:0.470364 104:0.189997 105:0.018296 106:0.126517 107:0.047620 108:0.045863 109:0.184865 110:0.095976 111:0.015295 112:0.056323 113:0.024587 115:0.037647 116:0.006015 117:0.160327 118:0.251688 119:0.176144 123:0.006356 219:0.002119 276:0.002119 296:0.341102 448:0.099576 491:0.069915 572:0.141949 573:0.103814 601:0.002119 623:0.050847 726:0.038136 762:0.036017 816:0.036017 871:0.016949 924:0.008475 959:0.036017 1002:0.006356 1016:0.008475 1048:0.002119 4124:0.422333825949 4125:0.663917631952

All five datasets are formated in the svmlight / libsvm format. This format is a text-based format, with one sample per line. It is a light format meaning it does not store zero valued features, every fetature that is "missing" has a value of zero. The first element of each line is used to store a target variable, and in this case it is the vaue of the atriburtes below. 

Hence, the file simply contains more records like the one shown above. While there are only 14 attributes in each dataset, most attributes can have more than one column of data. 

## Description of Attributes

The following sections describe this dataset using the Readme.txt file, examination of the data, and definition of the terms.

In [2]:
# We are using a Pandas dataframe to tabulate the data (and provide an simple introduction into Pandas)

#Dimension Index  ... first column

df_attributes = pd.DataFrame(
  data=[
    ('Dimension Index','0','integer',''),
    ('Shot Length','1','integer',''),
    ('Motion Distribution','2-3','float','Mean and Variance'),
    ('Frame Difference Distribution','4-5','float','Mean and Variance'),
    ('Short time energy','6-7','float','Mean and Variance'),
    ('ZCR','8-9','float','Mean and Variance'),
    ('Spectral Centroid','10-11','float','Mean and Variance'),
    ('Spectral Roll off','12-13','float','Mean and Variance'),
    ('Spectral Flux','14-15','float','Mean and Variance'),
    ('Fundamental Frequency','16-17','float','Mean and Variance'),
    ('Motion Distribution','18-58','float','40 bins'),
    ('Frame Difference Distribution','59-91','float','32 bins'),
    ('Text area distribution','92-122','float','15 bins Mean and 15 bins for variance'),
    ('Bag of Audio Words','123-4123','float','4,000 bins'), 
    ('Edge change Ratio','4124-4125','float','Mean and Variance')
  ],
  columns=[
    'Attribute Name','Columns','Datatype','Notes'
  ],
  index=[
    'Attribute 00', 'Attribute 01', 'Attribute 02', 'Attribute 03', 'Attribute 04', 'Attribute 05', 'Attribute 06',
    'Attribute 07', 'Attribute 08', 'Attribute 09', 'Attribute 10', 'Attribute 11', 'Attribute 12', 'Attribute 13',
    'Attribute 14'
  ]
)

# we will later omit the Bag of Audio Words attribute,"123-4123" to reduce the sparcity of the data.
# tabulate is used to left justify these string value columns (versus the right-justified default)

#from tabulate import tabulate

#print(tabulate(df_attributes, showindex=True, headers=df_attributes.columns))

### Attribute Descriptions

### Shot Length 
Commercial video shots are usually short in length, fast visual transitions with peculiar placement of overlaid text bands. Video Shot Length is directly used as one of the feature.

### Short time energy
Short term energy can be used for voiced, unvoiced and silence classification of speech. The relation for finding the short term energy can be derived from the total energy relation defined in signal processing.The total energy of an energy signal is given by.

### ZCR
Zero Crossing Rate (aka ZCR) is the rate of sign-changes along a signal. This is used in both speech recognition and music information retrieval and it is a feature used to classify sounds. That is percicely its use case here in this dataset, it till be used as ont of the attributes to help differenciate commercials from the news program. 

### Spectral Centroid
Spectral Centroid is a measure of the “center of gravity” using the fourier transform's frequency and magnitude information. It is commenly used in digital signal processing to help characterise a spectrum. 

### Spectral Roll off
Spectral Rolloff Point is a measure measure of the amount of the right-skewedness of the power spectrum.

### Spectral Flux
Spectral flux is a measure of how quickly the power spectrum of a signal is changes. It is calculated by comparing the power spectrum for one frame against the power spectrum from the previous frame.

### Fundamental Frequency
The fundamental frequency is the lowest frequency of a wwaveform. In music, the fundamental is the musical pitch of a note that is perceived as the lowest partial present.

### Motion Distribution
Motion Distribution is obtained by first computing dense optical flow (Horn-Schunk formulation) followed by construction of a distribution of flow magnitudes over the entire shot with 40 uniformly divided bins in range of [0, 40].

### Frame Difference Distribution
The Frame Difference Distribution is the measure of the difference between the current frame and a reference frame, often called "background image", or "background model". This will assist in measuring the percieved speed at which the frames appear to differientate. Sudden changes in pixel intensities are grasped by Frame Difference Distribution. Such changes are not registered by optical flow. Thus, Frame Difference Distribution is also computed along with flow magnitude distributions. The researchers obtain the frame difference by averaging absolute frame difference in each of 3 color channels and the distribution is constructed with 32 bins in the range of [0, 255] .

### Text area distribution
The Test Difference Distribution is simular to the Test Difference Distribution in that is is the measure of the difference between the current text on screen and a reference amount of text. The text distribution feature is obtained by averaging the fraction of text area present in a grid block over all frames of the shot.

### Bag of Audio Words
This attribute is to be removed to reduce the sparsness of the data set.

### Edge change Ratio
Edge Change Ratio Captures the motion of edges between consecutive frames and is defined as ratio of displaced edge pixels to the total number of edge pixels in a frame. The researchers calculated the mean and variance of the ECR over the entire shot. 

In [3]:
#print(df_temp1[0].column)

#pd.set_option('display.max_row', 1000)
#pd.set_option('display.max_columns', 150)

df_attributes.rename(columns={0: 'Dimension Index'}, inplace=True)
df_attributes.rename(columns={1: 'Shot'}, inplace=True)
df_attributes.rename(columns={2: 'Motion Distribution-Mean'}, inplace=True)
df_attributes.rename(columns={3: 'Motion Distribution-Variance'}, inplace=True)
df_attributes.rename(columns={4: 'Frame Difference Distribution-Mean'}, inplace=True)
df_attributes.rename(columns={5: 'Frame Difference Distribution-Variance'}, inplace=True)
df_attributes.rename(columns={6: 'Short time energy-Mean'}, inplace=True)
df_attributes.rename(columns={7: 'Short time energy-Variance'}, inplace=True)
df_attributes.rename(columns={8: 'ZCR-Mean'}, inplace=True)
df_attributes.rename(columns={9: 'ZCR-Variance'}, inplace=True)
df_attributes.rename(columns={10: 'Spectral Centroid-Mean'}, inplace=True)
df_attributes.rename(columns={11: 'Spectral Centroid-Variance'}, inplace=True)
df_attributes.rename(columns={12: 'Spectral Roll off-Mean'}, inplace=True)
df_attributes.rename(columns={13: 'Spectral Roll off-Variance'}, inplace=True)
df_attributes.rename(columns={14: 'Spectral Flux-Mean'}, inplace=True)
df_attributes.rename(columns={15: 'Spectral Flux-Variance'}, inplace=True)
df_attributes.rename(columns={16: 'Fundamental Frequency-Mean'}, inplace=True)
df_attributes.rename(columns={17: 'Fundamental Frequency-Variance'}, inplace=True)
df_attributes.rename(columns={18: 'Motion Distribution-Bin 1'}, inplace=True)
df_attributes.rename(columns={19: 'Motion Distribution-Bin 2'}, inplace=True)
df_attributes.rename(columns={20: 'Motion Distribution-Bin 3'}, inplace=True)
df_attributes.rename(columns={21: 'Motion Distribution-Bin 4'}, inplace=True)
df_attributes.rename(columns={22: 'Motion Distribution-Bin 5'}, inplace=True)
df_attributes.rename(columns={23: 'Motion Distribution-Bin 6'}, inplace=True)
df_attributes.rename(columns={24: 'Motion Distribution-Bin 7'}, inplace=True)
df_attributes.rename(columns={25: 'Motion Distribution-Bin 8'}, inplace=True)
df_attributes.rename(columns={26: 'Motion Distribution-Bin 9'}, inplace=True)
df_attributes.rename(columns={27: 'Motion Distribution-Bin 10'}, inplace=True)
df_attributes.rename(columns={28: 'Motion Distribution-Bin 11'}, inplace=True)
df_attributes.rename(columns={29: 'Motion Distribution-Bin 12'}, inplace=True)
df_attributes.rename(columns={30: 'Motion Distribution-Bin 13'}, inplace=True)
df_attributes.rename(columns={31: 'Motion Distribution-Bin 14'}, inplace=True)
df_attributes.rename(columns={32: 'Motion Distribution-Bin 15'}, inplace=True)
df_attributes.rename(columns={33: 'Motion Distribution-Bin 16'}, inplace=True)
df_attributes.rename(columns={34: 'Motion Distribution-Bin 17'}, inplace=True)
df_attributes.rename(columns={35: 'Motion Distribution-Bin 18'}, inplace=True)
df_attributes.rename(columns={36: 'Motion Distribution-Bin 19'}, inplace=True)
df_attributes.rename(columns={37: 'Motion Distribution-Bin 20'}, inplace=True)
df_attributes.rename(columns={38: 'Motion Distribution-Bin 21'}, inplace=True)
df_attributes.rename(columns={39: 'Motion Distribution-Bin 22'}, inplace=True)
df_attributes.rename(columns={40: 'Motion Distribution-Bin 23'}, inplace=True)
df_attributes.rename(columns={41: 'Motion Distribution-Bin 24'}, inplace=True)
df_attributes.rename(columns={42: 'Motion Distribution-Bin 25'}, inplace=True)
df_attributes.rename(columns={43: 'Motion Distribution-Bin 26'}, inplace=True)
df_attributes.rename(columns={44: 'Motion Distribution-Bin 27'}, inplace=True)
df_attributes.rename(columns={45: 'Motion Distribution-Bin 28'}, inplace=True)
df_attributes.rename(columns={46: 'Motion Distribution-Bin 29'}, inplace=True)
df_attributes.rename(columns={47: 'Motion Distribution-Bin 30'}, inplace=True)
df_attributes.rename(columns={48: 'Motion Distribution-Bin 31'}, inplace=True)
df_attributes.rename(columns={49: 'Motion Distribution-Bin 32'}, inplace=True)
df_attributes.rename(columns={50: 'Motion Distribution-Bin 33'}, inplace=True)
df_attributes.rename(columns={51: 'Motion Distribution-Bin 34'}, inplace=True)
df_attributes.rename(columns={52: 'Motion Distribution-Bin 35'}, inplace=True)
df_attributes.rename(columns={53: 'Motion Distribution-Bin 36'}, inplace=True)
df_attributes.rename(columns={54: 'Motion Distribution-Bin 37'}, inplace=True)
df_attributes.rename(columns={55: 'Motion Distribution-Bin 38'}, inplace=True)
df_attributes.rename(columns={56: 'Motion Distribution-Bin 39'}, inplace=True)
df_attributes.rename(columns={57: 'Motion Distribution-Bin 40'}, inplace=True)

# NOTE: Attribute 58 should be Bin 40 ... don't know what's wrong (other than readme.txt)

df_attributes.rename(columns={58: 'Attribute 58 should be Bin 40'}, inplace=True)

df_attributes.rename(columns={59: 'Frame Difference Distribution-Bin 1'}, inplace=True)
df_attributes.rename(columns={60: 'Frame Difference Distribution-Bin 2'}, inplace=True)
df_attributes.rename(columns={61: 'Frame Difference Distribution-Bin 3'}, inplace=True)
df_attributes.rename(columns={62: 'Frame Difference Distribution-Bin 4'}, inplace=True)
df_attributes.rename(columns={63: 'Frame Difference Distribution-Bin 5'}, inplace=True)
df_attributes.rename(columns={64: 'Frame Difference Distribution-Bin 6'}, inplace=True)
df_attributes.rename(columns={65: 'Frame Difference Distribution-Bin 7'}, inplace=True)
df_attributes.rename(columns={66: 'Frame Difference Distribution-Bin 8'}, inplace=True)
df_attributes.rename(columns={67: 'Frame Difference Distribution-Bin 9'}, inplace=True)
df_attributes.rename(columns={68: 'Frame Difference Distribution-Bin 10'}, inplace=True)
df_attributes.rename(columns={69: 'Frame Difference Distribution-Bin 11'}, inplace=True)
df_attributes.rename(columns={70: 'Frame Difference Distribution-Bin 12'}, inplace=True)
df_attributes.rename(columns={71: 'Frame Difference Distribution-Bin 13'}, inplace=True)
df_attributes.rename(columns={72: 'Frame Difference Distribution-Bin 14'}, inplace=True)
df_attributes.rename(columns={73: 'Frame Difference Distribution-Bin 15'}, inplace=True)
df_attributes.rename(columns={74: 'Frame Difference Distribution-Bin 16'}, inplace=True)
df_attributes.rename(columns={75: 'Frame Difference Distribution-Bin 17'}, inplace=True)
df_attributes.rename(columns={76: 'Frame Difference Distribution-Bin 18'}, inplace=True)
df_attributes.rename(columns={77: 'Frame Difference Distribution-Bin 19'}, inplace=True)
df_attributes.rename(columns={78: 'Frame Difference Distribution-Bin 20'}, inplace=True)
df_attributes.rename(columns={79: 'Frame Difference Distribution-Bin 21'}, inplace=True)
df_attributes.rename(columns={80: 'Frame Difference Distribution-Bin 22'}, inplace=True)
df_attributes.rename(columns={81: 'Frame Difference Distribution-Bin 23'}, inplace=True)
df_attributes.rename(columns={82: 'Frame Difference Distribution-Bin 24'}, inplace=True)
df_attributes.rename(columns={83: 'Frame Difference Distribution-Bin 25'}, inplace=True)
df_attributes.rename(columns={84: 'Frame Difference Distribution-Bin 26'}, inplace=True)
df_attributes.rename(columns={85: 'Frame Difference Distribution-Bin 27'}, inplace=True)
df_attributes.rename(columns={86: 'Frame Difference Distribution-Bin 28'}, inplace=True)
df_attributes.rename(columns={87: 'Frame Difference Distribution-Bin 29'}, inplace=True)
df_attributes.rename(columns={88: 'Frame Difference Distribution-Bin 30'}, inplace=True)
df_attributes.rename(columns={89: 'Frame Difference Distribution-Bin 31'}, inplace=True)
df_attributes.rename(columns={90: 'Frame Difference Distribution-Bin 32'}, inplace=True)

# NOTE: Attribute 91 should be Bin 32 ... don't know what's wrong (other than readme.txt)

df_attributes.rename(columns={91: 'Attribute 91 should be Bin 32'}, inplace=True)

df_attributes.rename(columns={92: 'Text area distribution-Bin 1-Mean'}, inplace=True)
df_attributes.rename(columns={93: 'Text area distribution-Bin 2-Mean'}, inplace=True)
df_attributes.rename(columns={94: 'Text area distribution-Bin 3-Mean'}, inplace=True)
df_attributes.rename(columns={95: 'Text area distribution-Bin 4-Mean'}, inplace=True)
df_attributes.rename(columns={96: 'Text area distribution-Bin 5-Mean'}, inplace=True)
df_attributes.rename(columns={97: 'Text area distribution-Bin 6-Mean'}, inplace=True)
df_attributes.rename(columns={98: 'Text area distribution-Bin 7-Mean'}, inplace=True)
df_attributes.rename(columns={99: 'Text area distribution-Bin 8-Mean'}, inplace=True)
df_attributes.rename(columns={100: 'Text area distribution-Bin 9-Mean'}, inplace=True)
df_attributes.rename(columns={101: 'Text area distribution-Bin 10-Mean'}, inplace=True)
df_attributes.rename(columns={102: 'Text area distribution-Bin 11-Mean'}, inplace=True)
df_attributes.rename(columns={103: 'Text area distribution-Bin 12-Mean'}, inplace=True)
df_attributes.rename(columns={104: 'Text area distribution-Bin 13-Mean'}, inplace=True)
df_attributes.rename(columns={105: 'Text area distribution-Bin 14-Mean'}, inplace=True)
df_attributes.rename(columns={106: 'Text area distribution-Bin 15-Mean'}, inplace=True)
df_attributes.rename(columns={107: 'Text area distribution-Bin 1-Variance'}, inplace=True)
df_attributes.rename(columns={108: 'Text area distribution-Bin 2-Variance'}, inplace=True)
df_attributes.rename(columns={109: 'Text area distribution-Bin 3-Variance'}, inplace=True)
df_attributes.rename(columns={110: 'Text area distribution-Bin 4-Variance'}, inplace=True)
df_attributes.rename(columns={111: 'Text area distribution-Bin 5-Variance'}, inplace=True)
df_attributes.rename(columns={112: 'Text area distribution-Bin 6-Variance'}, inplace=True)
df_attributes.rename(columns={113: 'Text area distribution-Bin 7-Variance'}, inplace=True)
df_attributes.rename(columns={114: 'Text area distribution-Bin 8-Variance'}, inplace=True)
df_attributes.rename(columns={115: 'Text area distribution-Bin 9-Variance'}, inplace=True)
df_attributes.rename(columns={116: 'Text area distribution-Bin 10-Variance'}, inplace=True)
df_attributes.rename(columns={117: 'Text area distribution-Bin 11-Variance'}, inplace=True)
df_attributes.rename(columns={118: 'Text area distribution-Bin 12-Variance'}, inplace=True)
df_attributes.rename(columns={119: 'Text area distribution-Bin 13-Variance'}, inplace=True)
df_attributes.rename(columns={120: 'Text area distribution-Bin 14-Variance'}, inplace=True)
df_attributes.rename(columns={121: 'Text area distribution-Bin 15-Variance'}, inplace=True)

# NOTE: Attribute 122 should be Bin 15-Variance ... don't know what's wrong (other than readme.txt)

df_attributes.rename(columns={122: 'Attribute 122 should be Bin 15-Variance'}, inplace=True)

df_attributes.rename(columns={121: 'Text area distribution-Bin 15-Variance'}, inplace=True)

for index, row in df_attributes.iterrows():
    print(row)

Attribute Name    Dimension Index
Columns                         0
Datatype                  integer
Notes                            
Name: Attribute 00, dtype: object
Attribute Name    Shot Length
Columns                     1
Datatype              integer
Notes                        
Name: Attribute 01, dtype: object
Attribute Name    Motion Distribution
Columns                           2-3
Datatype                        float
Notes               Mean and Variance
Name: Attribute 02, dtype: object
Attribute Name    Frame Difference Distribution
Columns                                     4-5
Datatype                                  float
Notes                         Mean and Variance
Name: Attribute 03, dtype: object
Attribute Name    Short time energy
Columns                         6-7
Datatype                      float
Notes             Mean and Variance
Name: Attribute 04, dtype: object
Attribute Name                  ZCR
Columns                         8-9
Datatype      

# Data Preparation

This section covers the activities needed to construct the dataset that will be fed into the models. The files for this project  (bbc.txt, cnn.txt, cnnibn.txt, ndtv.txt, and timesnow.txt) can be found at  https://archive.ics.uci.edu/ml/datasets/TV+News+Channel+Commercial+Detection+Dataset as a single ZIP file. To eliminate  manual work and streamline file processing, these five files were extracted and put on a team member's website (http://www.shookfamily.org) as follows:

http://www.shookfamily.org/data/BBC.txt (17,720 lines)

http://www.shookfamily.org/data/CNN.txt (22,545 lines)

http://www.shookfamily.org/data/CNNIBN.txt (33,117 lines)

http://www.shookfamily.org/data/NDTV.txt (17,051 lines)

http://www.shookfamily.org/data/TIMESNOW.txt (39,252 lines)

As shown in the cells below, it takes several steps to download the files and process them into the final dataset.

The overall goal is to download the files from the internet and load them into an in-memory object. Because these files are stored in the SVM Light format, they are first loaded into a scipy.sparse matrix array object. These sparse matrix arrays are then inspected to eliminate as many columns as possible, and, consequently, reduce the sparseness of the matrix. Once that is accomplished, the scipy.sparse matrix arrays are converted to Pandas DataFrames for faster data processing and input into the accompanying data models.


## Step 1: Download Files

The first step in this proces is to download the five files from the internet. The data is in a pickled (marshalled / serialized) format used to persist an SVM Light dataset. The SVM Light format is basically an Index : Value pair where the index represents an element in a sparse matrix array and the value associated with that element. For example, a partial record like the following:

> 1 1:123 2:1.316440 3:1.516003 ...

represents the Y-axis lable followed by the X-Axis values where the first, second, and third elements are a sparse matrix array with the values 123, 1.316440, and 1.516003 (or array[0] == 123, array[1] == 1.316440, and array[2] == 1.516003. The code below downloads each SVM Light file from the internet as a scipy.sparse matrix object. 

Note: It takes about 30 to 60 seconds to perform all five downloads.

In [4]:
import urllib.request
import tempfile

from sklearn.datasets import load_svmlight_file

url_bbc      = 'http://www.shookfamily.org/data/BBC.txt'
url_cnn      = 'http://www.shookfamily.org/data/CNN.txt'
url_cnnibn   = 'http://www.shookfamily.org/data/CNNIBN.txt'
url_ndtv     = 'http://www.shookfamily.org/data/NDTV.txt'
url_timesnow = 'http://www.shookfamily.org/data/TIMESNOW.txt'

################################################################################
# Download file to a temporary file. Load that file into a scipy.sparse matrix
# array, and then return that object to the caller.
################################################################################

def get_pickled_file(url):
    response = urllib.request.urlopen(url)
    data = response.read()      # a `bytes` object
    text = data.decode('utf-8') # a `str`; this step can't be used if data is binary

    with tempfile.NamedTemporaryFile(delete=False, mode='w') as file_handle:
        assert text is not None
        file_handle.write(text)
        filename = file_handle.name

        data = load_svmlight_file(filename)

        return data[0],data[1]   # data[0] == X axis and data[1] == Y axis

################################################################################
# Dowload files as scipy.sparse matrix arrays
################################################################################

print('Downloading datasets from the internet ...\n')
print('Downloading (as scipy.sparse matrix) ...', url_bbc)

%time sm_bbc = get_pickled_file(url_bbc)

print('Downloading (as scipy.sparse matrix) ...', url_cnn)

%time sm_cnn = get_pickled_file(url_cnn)

print('Downloading (as scipy.sparse matrix) ...', url_cnnibn)

%time sm_cnnibn = get_pickled_file(url_cnnibn)

print('Downloading (as scipy.sparse matrix) ...', url_ndtv)

%time sm_ndtv = get_pickled_file(url_ndtv)

print('Downloading (as scipy.sparse matrix) ...', url_timesnow)

%time sm_timesnow = get_pickled_file(url_timesnow)

print('\nAll files have been downloaded')

Downloading datasets from the internet ...

Downloading (as scipy.sparse matrix) ... http://www.shookfamily.org/data/BBC.txt
Wall time: 3.71 s
Downloading (as scipy.sparse matrix) ... http://www.shookfamily.org/data/CNN.txt
Wall time: 5.88 s
Downloading (as scipy.sparse matrix) ... http://www.shookfamily.org/data/CNNIBN.txt
Wall time: 8.6 s
Downloading (as scipy.sparse matrix) ... http://www.shookfamily.org/data/NDTV.txt
Wall time: 4.45 s
Downloading (as scipy.sparse matrix) ... http://www.shookfamily.org/data/TIMESNOW.txt
Wall time: 10.1 s

All files have been downloaded


## Step 2: Convert to X-axis to Pandas Sparse Dataframes

This step converts the scipy.sparse matrix arrays (X axis) into Pandas Sparse Dataframes (to get them into the Panda ecosystem).

Note: It takes about 5 minutes to perform these conversions (~1 minute per conversion).

In [5]:
# This code creates the Pandas Sparse Dataframes from the X axis

print('Converting scipy.sparse matrix arrays to Pandas Sparse Dataframes ...\n')
print('Converting sm_bbc[0] to sdf_bbcX ...')

%time sdf_bbcX = pd.SparseDataFrame(sm_bbc[0])

print('Converting sm_cnn[0] to sdf_cnnX ...')

%time sdf_cnnX = pd.SparseDataFrame(sm_cnn[0])

print('Converting sm_ibn[0] to sdf_cnnibnX ...')

%time sdf_cnnibnX = pd.SparseDataFrame(sm_cnnibn[0])

print('Converting sm_ndtv[0] to sdf_ndtvX ...')

%time sdf_ndtvX = pd.SparseDataFrame(sm_ndtv[0])

print('Converting sm_timesnow[0] to sdf_timesnowX ...')

%time sdf_timesnowX = pd.SparseDataFrame(sm_timesnow[0])

print('\nAll scipy.sparse matrix arrays have been converted to Pandas Sparse Dataframes\n')
print(sdf_bbcX.info())
print(sdf_cnnX.info())
print(sdf_cnnibnX.info())
print(sdf_ndtvX.info())
print(sdf_timesnowX.info())

print('\nDONE')

Converting scipy.sparse matrix arrays to Pandas Sparse Dataframes ...

Converting sm_bbc[0] to sdf_bbcX ...
Wall time: 639 ms
Converting sm_cnn[0] to sdf_cnnX ...
Wall time: 724 ms
Converting sm_ibn[0] to sdf_cnnibnX ...
Wall time: 886 ms
Converting sm_ndtv[0] to sdf_ndtvX ...
Wall time: 679 ms
Converting sm_timesnow[0] to sdf_timesnowX ...
Wall time: 1.03 s

All scipy.sparse matrix arrays have been converted to Pandas Sparse Dataframes

<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4125 entries, 0 to 4124
dtypes: float64(4125)
memory usage: 13.8 MB
None
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Columns: 4125 entries, 0 to 4124
dtypes: float64(4125)
memory usage: 22.1 MB
None
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 33117 entries, 0 to 33116
Columns: 4125 entries, 0 to 4124
dtypes: float64(4125)
memory usage: 32.0 MB
None
<class 'pandas.core.sparse.frame.SparseDataFr

## Step 3: Convert Y-axis to Pandas Sparse Dataframes

This step converts the scipy.sparse matrix array (Y axis) into a Pandas Sparse Dataframe (to get it into the Panda ecosystem).

Note: It takes just a few seconds to perform these conversions.

In [6]:
print('Converting scipy.sparse matrix array to Pandas Sparse Dataframe ...')

print('\nConverting sm_bbc[1] to sdf_bbcY ...')
%time sdf_bbcY = pd.SparseDataFrame(sm_bbc[1])
print(sdf_bbcY.info())
print(sdf_bbcY.head())

print('\nConverting sm_cnn[1] to sdf_cnnY ...')
%time sdf_cnnY = pd.SparseDataFrame(sm_cnn[1])
print(sdf_cnnY.info())
print(sdf_cnnY.head())

print('\nConverting sm_cnnibn[1] to sdf_cnnibmY ...')
%time sdf_cnnibnY = pd.SparseDataFrame(sm_cnnibn[1])
print(sdf_cnnibnY.info())
print(sdf_cnnibnY.head())

print('\nConverting sm_ndtv[1] to sdf_ndtvY ...')
%time sdf_ndtvY = pd.SparseDataFrame(sm_ndtv[1])
print(sdf_ndtvY.info())
print(sdf_ndtvY.head())

print('\nConverting sm_timesnow[1] to sdf_timesnowY ...')
%time sdf_timesnowY = pd.SparseDataFrame(sm_timesnow[1])
print(sdf_timesnowY.info())
print(sdf_timesnowY.head())

print('\nDONE')

Converting scipy.sparse matrix array to Pandas Sparse Dataframe ...

Converting sm_bbc[1] to sdf_bbcY ...
Wall time: 1e+03 µs
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Data columns (total 1 columns):
0    17720 non-null float64
dtypes: float64(1)
memory usage: 138.5 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

Converting sm_cnn[1] to sdf_cnnY ...
Wall time: 0 ns
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Data columns (total 1 columns):
0    22545 non-null float64
dtypes: float64(1)
memory usage: 176.2 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

Converting sm_cnnibn[1] to sdf_cnnibmY ...
Wall time: 2 ms
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 33117 entries, 0 to 33116
Data columns (total 1 columns):
0    33117 non-null float64
dtypes: float64(1)
memory usage: 258.8 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0

Converting sm_ndtv[1] to sdf_ndtvY ...
Wall t

## Convert to a Fixed DataFrame

In [7]:
df_bbcX      = pd.DataFrame(sdf_bbcX)
df_cnnX      = pd.DataFrame(sdf_cnnX)
df_cnnibnX   = pd.DataFrame(sdf_cnnibnX)
df_ndtvX     = pd.DataFrame(sdf_ndtvX)
df_timesnowX = pd.DataFrame(sdf_timesnowX)

print(df_bbcX.info())
print(df_bbcX.head())
print('\n')
print(df_cnnX.info())
print(df_cnnX.head())
print('\n')
print(df_cnnibnX.info())
print(df_cnnibnX.head())
print('\n')
print(df_ndtvX.info())
print(df_ndtvX.head())
print('\n')
print(df_timesnowX.info())
print(df_timesnowX.head())
print('\n')

print('\nDONE')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4125 entries, 0 to 4124
dtypes: float64(4125)
memory usage: 13.8 MB
None
    0         1         2          3         4         5         6     \
0  123.0  1.316440  1.516003   5.605905  5.346760  0.013233  0.010729   
1  124.0  0.966079  0.546420   4.046537  3.190973  0.008338  0.011490   
2  109.0  2.035407  0.571643   9.551406  5.803685  0.015189  0.014294   
3   86.0  3.206008  0.786326  10.092709  2.693058  0.013962  0.011039   
4   76.0  3.135861  0.896346  10.348035  2.651010  0.020914  0.012061   

       7         8            9       ...     4115  4116  4117  4118  4119  \
0  0.091743  0.050768  3808.067871    ...      NaN   NaN   NaN   NaN   NaN   
1  0.075504  0.065841  3466.266113    ...      NaN   NaN   NaN   NaN   NaN   
2  0.094209  0.044991  3798.196533    ...      NaN   NaN   NaN   NaN   NaN   
3  0.092042  0.043756  3761.712402    ...      NaN   NaN   NaN   NaN   NaN   
4  0.108018  

In [8]:
df_bbcY      = pd.DataFrame(sdf_bbcY)
df_cnnY      = pd.DataFrame(sdf_cnnY)
df_cnnibnY   = pd.DataFrame(sdf_cnnibnY)
df_ndtvY     = pd.DataFrame(sdf_ndtvY)
df_timesnowY = pd.DataFrame(sdf_timesnowY)

print(df_bbcY.info())
print(df_bbcY.head())
print('\n')
print(df_cnnY.info())
print(df_cnnY.head())
print('\n')
print(df_cnnibnY.info())
print(df_cnnibnY.head())
print('\n')
print(df_ndtvY.info())
print(df_ndtvY.head())
print('\n')
print(df_timesnowY.info())
print(df_timesnowY.head())
print('\n')

print('\nDONE')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Data columns (total 1 columns):
0    17720 non-null float64
dtypes: float64(1)
memory usage: 138.5 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Data columns (total 1 columns):
0    22545 non-null float64
dtypes: float64(1)
memory usage: 176.2 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33117 entries, 0 to 33116
Data columns (total 1 columns):
0    33117 non-null float64
dtypes: float64(1)
memory usage: 258.8 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17051 entries, 0 to 17050
Data columns (total 1 columns):
0    17051 non-null float64
dtypes: float64(1)
memory usage: 133.3 KB
None
     0
0  1.0
1  1.0
2  1.0
3  1.0
4  1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39252 entries, 0 to 39251
Data columns (to

In [9]:
%time df_bbc      = pd.concat([df_bbcX, df_bbcY], axis=1)
%time df_cnn      = pd.concat([df_cnnX, df_cnnY], axis=1)
%time df_cnnibn   = pd.concat([df_cnnibnX, df_cnnibnY], axis=1)
%time df_ndtv     = pd.concat([df_ndtvX, df_ndtvY], axis=1)
%time df_timesnow = pd.concat([df_timesnowX, df_timesnowY], axis=1)

print(df_bbc.info())
print(df_bbc.head())
print('\n')
print(df_cnn.info())
print(df_cnn.head())
print('\n')
print(df_cnnibn.info())
print(df_cnnibn.head())
print('\n')
print(df_ndtv.info())
print(df_ndtv.head())
print('\n')
print(df_timesnow.info())
print(df_timesnow.head())
print('\n')

print('\nDONE')

Wall time: 148 ms
Wall time: 139 ms
Wall time: 149 ms
Wall time: 131 ms
Wall time: 273 ms
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 14.0 MB
None
   0         1         2          3         4         5         6     \
0   1.0  1.316440  1.516003   5.605905  5.346760  0.013233  0.010729   
1   1.0  0.966079  0.546420   4.046537  3.190973  0.008338  0.011490   
2   1.0  2.035407  0.571643   9.551406  5.803685  0.015189  0.014294   
3   1.0  3.206008  0.786326  10.092709  2.693058  0.013962  0.011039   
4   1.0  3.135861  0.896346  10.348035  2.651010  0.020914  0.012061   

       7         8            9     ...   4116  4117  4118  4119  4120  4121  \
0  0.091743  0.050768  3808.067871  ...    NaN   NaN   NaN   NaN   NaN   NaN   
1  0.075504  0.065841  3466.266113  ...    NaN   NaN   NaN   NaN   NaN   NaN   
2  0.094209  0.044991  3798.196533  ...    NaN   NaN   NaN   NaN   Na

## Step 4: Merge the Pandas Sparse Dataframes

This step prepends the Y-Axis to the X-Axis by concatenating the two Pandas Sparse Dataframes. This aligns the dependent variable (Dimension Index) as the first column in the dataframe and the remaining columns as the independent variables.

Note: This step takes about 5 seconds (~1 second for each concatenation).

In [10]:
%time sdf_concat_bbc      = pd.concat([sdf_bbcX, sdf_bbcY], axis=1)
%time sdf_concat_cnn      = pd.concat([sdf_cnnX, sdf_cnnY], axis=1)
%time sdf_concat_cnnibn   = pd.concat([sdf_cnnibnX, sdf_cnnibnY], axis=1)
%time sdf_concat_ndtv     = pd.concat([sdf_ndtvX, sdf_ndtvY], axis=1)
%time sdf_concat_timesnow = pd.concat([sdf_timesnowX, sdf_timesnowY], axis=1)

print('\nDONE')

Wall time: 125 ms
Wall time: 309 ms
Wall time: 133 ms
Wall time: 127 ms
Wall time: 143 ms

DONE


In [11]:
sdf_concat_bbc.info()
sdf_concat_cnn.info()
sdf_concat_cnnibn.info()
sdf_concat_ndtv.info()
sdf_concat_timesnow.info()

print('\nDONE')

<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 14.0 MB
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 22.3 MB
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 33117 entries, 0 to 33116
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 32.2 MB
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 17051 entries, 0 to 17050
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 16.5 MB
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 39252 entries, 0 to 39251
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 38.4 MB

DONE


## Step 5: Concatenate into a Single Pandas Sparse DataFrame

Now that all five datasets have been reduced into a fixed matrix, concatenate these fixed matrix arrays into a single Pandas DataFrame. Thus, consolidating five operations into one for the rest of this notebook.

Note: This code executes in about 10 seconds.

In [12]:
# Convert to fixed matrix

df_concat_bbc      = pd.DataFrame(sdf_concat_bbc)
df_concat_cnn      = pd.DataFrame(sdf_concat_cnn)
df_concat_cnnibn   = pd.DataFrame(sdf_concat_cnnibn)
df_concat_ndtv     = pd.DataFrame(sdf_concat_ndtv)
df_concat_timesnow = pd.DataFrame(sdf_concat_timesnow)

df_concat_bbc.info()
df_concat_cnn.info()
df_concat_cnnibn.info()
df_concat_ndtv.info()
df_concat_timesnow.info()

print('\nDONE')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 14.0 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 22.3 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33117 entries, 0 to 33116
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 32.2 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17051 entries, 0 to 17050
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 16.5 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39252 entries, 0 to 39251
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 38.4 MB

DONE


In [13]:
df_concat_bbc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4116,4117,4118,4119,4120,4121,4122,4123,4124,0.1
0,123.0,1.31644,1.516003,5.605905,5.34676,0.013233,0.010729,0.091743,0.050768,3808.067871,...,,,,,,,,0.422334,0.663918,1.0
1,124.0,0.966079,0.54642,4.046537,3.190973,0.008338,0.01149,0.075504,0.065841,3466.266113,...,,,,,,,,0.332664,0.766184,1.0
2,109.0,2.035407,0.571643,9.551406,5.803685,0.015189,0.014294,0.094209,0.044991,3798.196533,...,,,,,,,,0.346674,0.225022,1.0
3,86.0,3.206008,0.786326,10.092709,2.693058,0.013962,0.011039,0.092042,0.043756,3761.712402,...,,,,,,,,0.993323,0.840083,1.0
4,76.0,3.135861,0.896346,10.348035,2.65101,0.020914,0.012061,0.108018,0.052617,3784.488037,...,,,,,,,,0.34152,0.71047,1.0


In [14]:
#frames = [df_concat_bbc, df_concat_cnn, df_concat_cnnibn, df_concat_ndtv, df_concat_timesnow]
frames = [df_concat_bbc, df_concat_cnn]

%time df_concat = pd.concat(frames)

print('\n')
df_concat.info()

print('\nDONE')

Wall time: 2.86 s


<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 40265 entries, 0 to 22544
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 36.5 MB

DONE


In [15]:
len(df_concat.index)

40265

In [16]:
#df_concat.head()

## Step 6: Reduce Sparseness of the X-axis (First Pass)

The reason this is a sparse matrix is primarily due to the following four features:

> Motion Distribution (columns 18 - 58)

> Frame Difference Distribution (columns 59 - 91)

> Text area distribution (columns 92 - 122)

> Bag of Audio Words (columns 123 - 4123)


The assumption is that these columns could even be empty. Consequently, this step will all inspect all columns that are completely empty (i.e., all column values are NaN (Not a Number) and then remove those columns if they are empty. The assumption is that all columns before Motion Distribution (columns 18 - 58) have date (i.e., they are not sparse). Hence, if any columns before column 18 are empty, this step will print that column number for reference. (Others are simple deleted.)

Note: This step takes about 20 minutes to execute (about 4 minutes per sparsed matrix object)

In [17]:
df_concat = pd.DataFrame(sdf_concat)

for i in df_concat:
    print(df_concat[i])
    
print('\nDONE')

NameError: name 'sdf_concat' is not defined

In [18]:
################################################################################
# Delete any column that has 100% of its rows as Not a Number (NaN)
################################################################################

def delete_nan_columns(df):
    total_rows_deleted = 0
    
    for i in df:
        total_nan_rows = df[i].isnull().sum()
        if (total_nan_rows == len(df.index)):
            del df[i]
            total_rows_deleted = total_rows_deleted + 1

    return total_rows_deleted

################################################################################
################################################################################

print('\nDeleting all columns that have all values that are 100% NaN (Not a Number) ...\n')

%time total_nan_cols = delete_nan_columns(df_concat)

#%time total_nan_cnn_cols      = delete_nan_columns(sdf_cnnX)
#%time total_nan_cnnibn_cols   = delete_nan_columns(sdf_cnnibnX)
#%time total_nan_ndtv_cols     = delete_nan_columns(sdf_ndtvX)
#%time total_nan_timesnow_cols = delete_nan_columns(sdf_timesnowX)

#print('\nNaN columns have been deleted (out of 4,125)\n')

#print('bbc:      ', total_nan_cols)
#print('cnn:      ', total_nan_cnn_cols)
#print('cnnibn:   ', total_nan_cnnibn_cols)
#print('ndtv:     ', total_nan_ndtv_cols)
#print('timesnow: ', total_nan_timesnow_cols)

print('\nDONE')


Deleting all columns that have all values that are 100% NaN (Not a Number) ...



AttributeError: 'BlockManager' object has no attribute 'T'


DONE


## Step 7: Inspect the Dependent Variable (Dimension Index)

In [19]:
# This code creates a Pandas Series from the Y axis, and then prepends that to the beginning of the Pandas Sparse Dataframes

import numpy as np

tot_y_pos_1_bbc = np.count_nonzero(sm_bbc[1] == 1)
tot_y_neg_1_bbc = np.count_nonzero(sm_bbc[1] == -1)
tot_bbc = tot_y_pos_1_bbc + tot_y_neg_1_bbc

print('sm_bbc - count of +1 and -1 values, and total: ', tot_y_pos_1_bbc, tot_y_neg_1_bbc, tot_bbc)

tot_y_pos_1_cnn = np.count_nonzero(sm_cnn[1] == 1)
tot_y_neg_1_cnn = np.count_nonzero(sm_cnn[1] == -1)
tot_cnn = tot_y_pos_1_cnn + tot_y_neg_1_cnn

print('sm_cnn - count of +1 and -1 values, and total: ', tot_y_pos_1_cnn, tot_y_neg_1_cnn, tot_cnn)

tot_y_pos_1_cnnibn = np.count_nonzero(sm_cnnibn[1] == 1)
tot_y_neg_1_cnnibn = np.count_nonzero(sm_cnnibn[1] == -1)
tot_cnnibn = tot_y_pos_1_cnnibn + tot_y_neg_1_cnnibn

print('sm_cnnibn - count of +1 and -1 values, and total: ', tot_y_pos_1_cnnibn, tot_y_neg_1_cnnibn, tot_cnnibn)

tot_y_pos_1_ndtv = np.count_nonzero(sm_ndtv[1] == 1)
tot_y_neg_1_ndtv = np.count_nonzero(sm_ndtv[1] == -1)
tot_ndtv = tot_y_pos_1_ndtv + tot_y_neg_1_ndtv

print('sm_ndtv - count of +1 and -1 values, and total: ', tot_y_pos_1_ndtv, tot_y_neg_1_ndtv, tot_ndtv)

tot_y_pos_1_timesnow = np.count_nonzero(sm_timesnow[1] == 1)
tot_y_neg_1_timesnow = np.count_nonzero(sm_timesnow[1] == -1)
tot_timesnow = tot_y_pos_1_timesnow + tot_y_neg_1_timesnow

print('sm_ndtv - count of +1 and -1 values, and total: ', tot_y_pos_1_timesnow, tot_y_neg_1_timesnow, tot_timesnow)

grand_total = tot_bbc + tot_cnn + tot_cnnibn + tot_ndtv + tot_timesnow

print('Grand total: ', grand_total)

print('\nDONE')

sm_bbc - count of +1 and -1 values, and total:  8416 9304 17720
sm_cnn - count of +1 and -1 values, and total:  14411 8134 22545
sm_cnnibn - count of +1 and -1 values, and total:  21693 11424 33117
sm_ndtv - count of +1 and -1 values, and total:  12564 4487 17051
sm_ndtv - count of +1 and -1 values, and total:  25147 14105 39252
Grand total:  129685

DONE


## Step 8: Convert to Normal Dataframes

Now that thousands of useless columns have been deleted, this step will convert the Sparse DataFrame to a normal DataFrame. This will convert the sparse matrix to a fixed matrix and greatly improve performance in further data analysis.

Note: This cell runs is just a few seconds.

In [20]:
# Sanity check: the total number of row should not have changed when deleting columns

print('bbc:      ', len(sdf_concat_bbc.index))
print('cnn:      ', len(sdf_concat_cnn.index))
print('cnnibn:   ', len(sdf_concat_cnnibn.index))
print('ndtv:     ', len(sdf_concat_ndtv.index))
print('timesnow: ', len(sdf_concat_timesnow.index))

print('\nTotal rows: ', len(sdf_concat_bbc.index) + len(sdf_concat_cnn.index) + len(sdf_concat_cnnibn.index) + len(sdf_concat_ndtv.index) + len(sdf_concat_timesnow.index))

# Convert sparse matrix arrays to fixed arrays (now that most of the columns have been deleted)

print('\n')

%time df_bbc      = pd.DataFrame(sdf_concat_bbc)
%time df_cnn      = pd.DataFrame(sdf_concat_cnn)
%time df_cnnibn   = pd.DataFrame(sdf_concat_cnnibn)
%time df_ndtv     = pd.DataFrame(sdf_concat_ndtv)
%time df_timesnow = pd.DataFrame(sdf_concat_timesnow)

print('\nbbc:      ', len(df_bbc.index))
print('cnn:      ', len(df_cnn.index))
print('cnnibn:   ', len(df_cnnibn.index))
print('ndtv:     ', len(df_ndtv.index))
print('timesnow: ', len(df_timesnow.index))

print('\nTotal rows: ', len(df_bbc.index) + len(df_cnn.index) + len(df_cnnibn.index) + len(df_ndtv.index) + len(df_timesnow.index))

#print(df_bbc.head())
#print(df_cnn.head())
#print(df_cnnibn.head())
#print(df_ndtv.head())
#print(df_timesnow.head())

df_bbc.info()
df_cnn.info()

print('\nDONE')

bbc:       17720
cnn:       22545
cnnibn:    33117
ndtv:      17051
timesnow:  39252

Total rows:  129685


Wall time: 0 ns
Wall time: 0 ns
Wall time: 0 ns
Wall time: 0 ns
Wall time: 0 ns

bbc:       17720
cnn:       22545
cnnibn:    33117
ndtv:      17051
timesnow:  39252

Total rows:  129685
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17720 entries, 0 to 17719
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 14.0 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22545 entries, 0 to 22544
Columns: 4126 entries, 0 to 0
dtypes: float64(4126)
memory usage: 22.3 MB

DONE


## Step 7: Eliminate Unused Columns (Second Pass)

Now that the datasets have been consolidated, columns with more that X% NaN elements will be deleted to further reduce sparsity.

In [21]:
# NOT DONE YET -- WIP

def percent_nan_columns(df):
    percent_nan_threshold = 0.9
    total_nan_cols = 0

    for i in df:
        total_nan_rows = df[i].isnull().sum()

        percent_nan_rows = total_nan_rows / len(df.index)

        #if (percent_nan_rows > percent_nan_threshold):
            #print('Column: ', i, total_nan_rows, percent_nan_rows)

    return total_nan_cols

%time percent_nan_columns(df_concat)

for i in my_df_bbc:
    if i == 86:
        print(my_df_bbc[i])
        
print('\nDONE')

AttributeError: 'BlockManager' object has no attribute 'T'

NameError: name 'my_df_bbc' is not defined

# Hexagon Bin Plot
A hexagon bin plot can be created using the DataFrame.plot() function and kind = 'hexbin'. 

This kind of plot is really useful if your scatter plot is too dense to interpret. It helps in binning the spatial area of the chart and the intensity of the color that a hexagon can be interpreted as points being more concentrated in this area.

In [22]:
import plotly.plotly as py
import plotly.graph_objs as go

import numpy as np

print('\nTEST')

x = np.random.randn(500)
data = [go.Histogram(x=x)]

py.iplot(data, filename='basic histogram')

#print('\nData Frame Shape:')
#print(df_bbc.shape)

#print('\nCreating Table...')
#table = ff.create_table(df_bbc)

#print('\nPlotting...')
#py.iplot(table, filename='jupyter/table1')



TEST
Aw, snap! We don't have an account for ''. Want to try again? You can authenticate with your email address or username. Sign in is not case sensitive.

Don't have an account? plot.ly

Questions? support@plot.ly


PlotlyError: Because you didn't supply a 'file_id' in the call, we're assuming you're trying to snag a figure from a url. You supplied the url, '', we expected it to start with 'https://plot.ly'.
Run help on this function for more information.