## Data Preprocessing

This notebook summarizes the steps taken to transform the data from its original format to a format ready for analysis.

In [1]:
# import packages
import numpy as np
import pandas as pd
import re
pd.options.display.max_columns = 500

### Building the df_model dataframe

The first step is to import the dataset into a dataframe.  The `.csv` file I am importing was created through the process discussed and outlined in the `Data Wrangling` notebook for this project, which can be found in the same repository as this notebook.

In [12]:
df = pd.read_csv('nc_cases.csv', index_col=0, low_memory=False)
df.head()

Unnamed: 0,decision_date,docket_number,first_page,frontend_url,id,last_page,name,name_abbreviation,citation_type,citation,court_id,court_jurisdiction_url,court_name,court_name_abbreviation,court_slug,reporter_full_name,reporter_id,barcode,volume_number,data_attorneys,data_corrections,data_head_matter,data_judges,status,first_opinion,first_type,first_author,second_opinion,second_type,second_author,third_opinion,third_type,third_author,fourth_opinion,fourth_type,fourth_author,fifth_opinion,fifth_type,fifth_author,sixth_opinion,sixth_type,sixth_author
0,1997-04-10,No. 132P97,759,https://cite.capapi.org/nc/345/759/53834/,53834,759,STATE v. WILSON,State v. Wilson,official,345 N.C. 759,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. WILSON\nNo. 132P97,[],ok,Notice of appeal by defendant (substantial con...,majority,,,,,,,,,,,,,,,,
1,1997-02-07,No. 358P96,342,https://cite.capapi.org/nc/345/342/53835/,53835,342,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.,In re Appeal of Camel City Laundry Co.,official,345 N.C. 342,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,IN RE APPEAL OF CAMEL CITY LAUNDRY CO.\nNo. 35...,[],ok,Petition by petitioner for discretionary revie...,majority,,,,,,,,,,,,,,,,
2,1997-04-10,No. 93P97,752,https://cite.capapi.org/nc/345/752/53836/,53836,752,GILLIAM v. FIRST UNION NAT. BANK,Gilliam v. First Union Nat. Bank,official,345 N.C. 752,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,GILLIAM v. FIRST UNION NAT. BANK\nNo. 93P97,[],ok,Petition by plaintiff for discretionary review...,majority,,,,,,,,,,,,,,,,
3,1997-01-15,No. 9A94-2,348,https://cite.capapi.org/nc/345/348/53837/,53837,348,STATE v. ATKINS,State v. Atkins,official,345 N.C. 348,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,STATE v. ATKINS\nNo. 9A94-2,[],ok,Petition by defendant for writ of certiorari t...,majority,,,,,,,,,,,,,,,,
4,1997-02-07,No. 345P96,344,https://cite.capapi.org/nc/345/344/53838/,53838,344,MILLER v. BROOKS,Miller v. Brooks,official,345 N.C. 344,9292,,Supreme Court of North Carolina,N.C.,nc,North Carolina Reports,549,32044049256738,345,[],,MILLER v. BROOKS\nNo. 345P96,[],ok,Petition by defendants for discretionary revie...,majority,,,,,,,,,,,,,,,,


I will next create a new dataframe, `df_model`, which I will ultimately use as the dataset for exploratory data analysis and for building my model.  I do not need all of this data for my model, so I will only be adding the unique identification number, casename, and date as a starting point for the `df_model` dataframe.  

In [13]:
df_model = pd.DataFrame({'id': df.loc[:,'id'], 'casename': df.loc[:,'name_abbreviation'], 'date': df.loc[:,'decision_date']})
df_model.head()

Unnamed: 0,id,casename,date
0,53834,State v. Wilson,1997-04-10
1,53835,In re Appeal of Camel City Laundry Co.,1997-02-07
2,53836,Gilliam v. First Union Nat. Bank,1997-04-10
3,53837,State v. Atkins,1997-01-15
4,53838,Miller v. Brooks,1997-02-07


Now that I have added the pertinent identifier information for each case, I can add the case opinions and authoring judges that will be the focus of my analysis.  Each case had anywhere from one to six written opinions, which were stored in separate columns in the same case row in the origina `.json` file.  For the purpose of my analysis, however, I need to separate each opinion into its own unique row in the `df_model` dataframe.  I will start by adding the first opinion for each case because I know that each case in the original dataset has at least one written opinion:

In [14]:
df_model['judge'] = df.loc[:, 'first_author']
df_model['type'] = df.loc[:, 'first_type']
df_model['opinion'] = df.loc[:, 'first_opinion']
df_model.head()

Unnamed: 0,id,casename,date,judge,type,opinion
0,53834,State v. Wilson,1997-04-10,,majority,Notice of appeal by defendant (substantial con...
1,53835,In re Appeal of Camel City Laundry Co.,1997-02-07,,majority,Petition by petitioner for discretionary revie...
2,53836,Gilliam v. First Union Nat. Bank,1997-04-10,,majority,Petition by plaintiff for discretionary review...
3,53837,State v. Atkins,1997-01-15,,majority,Petition by defendant for writ of certiorari t...
4,53838,Miller v. Brooks,1997-02-07,,majority,Petition by defendants for discretionary revie...


Next, I will add the additional opinions for each case as separate rows in the `df_model` database.  Not every case has multiple opinions, so I will use the `isna()` method to isolate the particular cases that have two or more opinions.  I will then segment those opinions temporarily into separate dataframes and then add them to the `df_model` dataframe using `pd.concat()`:

In [15]:
# save each group of opinions temporarily into separate dataframes

df_second = df[df['second_opinion'].isna() == False]
df_second = df_second[['id', 'name_abbreviation', 'decision_date', 'second_opinion', 'second_type', 'second_author']]
df_second.columns = ['id', 'casename', 'date', 'opinion', 'type', 'judge']

df_third = df[df['third_opinion'].isna() == False]
df_third = df_third[['id', 'name_abbreviation', 'decision_date', 'third_opinion', 'third_type', 'third_author']]
df_third.columns = ['id', 'casename', 'date', 'opinion', 'type', 'judge']

df_fourth = df[df['fourth_opinion'].isna() == False]
df_fourth = df_fourth[['id', 'name_abbreviation', 'decision_date', 'fourth_opinion', 'fourth_type', 'fourth_author']]
df_fourth.columns = ['id', 'casename', 'date', 'opinion', 'type', 'judge']

df_fifth = df[df['fifth_opinion'].isna() == False]
df_fifth = df_fifth[['id', 'name_abbreviation', 'decision_date', 'fifth_opinion', 'fifth_type', 'fifth_author']]
df_fifth.columns = ['id', 'casename', 'date', 'opinion', 'type', 'judge']

df_sixth = df[df['sixth_opinion'].isna() == False]
df_sixth = df_sixth[['id', 'name_abbreviation', 'decision_date', 'sixth_opinion', 'sixth_type', 'sixth_author']]
df_sixth.columns = ['id', 'casename', 'date', 'opinion', 'type', 'judge']

In [16]:
# add all of the rows in the temporary dataframes into df_model
df_model = pd.concat([df_model, df_second, df_third, df_fourth, df_fifth, df_sixth], axis=0, ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [17]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104208 entries, 0 to 104207
Data columns (total 6 columns):
casename    104208 non-null object
date        104208 non-null object
id          104208 non-null int64
judge       84398 non-null object
opinion     104194 non-null object
type        104194 non-null object
dtypes: int64(1), object(5)
memory usage: 4.8+ MB


In [18]:
# QC check to make sure that cases with multiple opinions have multiple rows in the df_model database
df_model['id'].value_counts()[:13]

11270334    6
8653789     5
8612587     5
8557798     5
8566644     5
8570287     5
8623087     5
8649463     5
8629717     5
8630764     5
8562214     5
11271899    4
8661434     4
Name: id, dtype: int64

As we can see from above, the `df_model` dataframe has 104,208 separate court opinions in total and has a number of opinions as separate rows with the same identification number, confirming that we built the dataframe correctly.  I will now save this dataframe as a local `.csv` file:

In [14]:
export_df = df_model.to_csv('df_model.csv')

### Removing the cases without written decisions

As we can see from `.info()` information above, the `df_model` database has 104,208 separate case entries in total, but only 84,398 of those entries have an authoring judge listed.  These approximately 20,000 cases have no judge information because they were procedural events where the court did not issue a substantive opinion (e.g., a denial of a petition for appeal).  I will remove these from the database as they are not pertinent to the goal of this project, which is to identify authorship based on the substance of a written court opinion:

In [19]:
df_model = df_model.dropna().reset_index()
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84398 entries, 0 to 84397
Data columns (total 7 columns):
index       84398 non-null int64
casename    84398 non-null object
date        84398 non-null object
id          84398 non-null int64
judge       84398 non-null object
opinion     84398 non-null object
type        84398 non-null object
dtypes: int64(2), object(5)
memory usage: 4.5+ MB


### Cleaning the judge authorship data

The name of the judge that authored each opinion is listed in the `judge` column of our dataframe.  Unfortunately, this data was not recorded using any type of judge identification number, but rather via a text OCR process that resulted in numerous inaccurate spellings.  To compound the issue, there also was not a consistent naming convention used.  Running a `value_counts()` on the `judge` column highlights these issues:

In [20]:
df_model['judge'].value_counts()

Per Curiam.                             2190
PER CURIAM.                             1427
Stacy, C. J.                            1054
HEDRICK, Judge.                         1013
ARNOLD, Judge.                           972
PARKER, Judge.                           936
GREENE, Judge.                           878
WELLS, Judge.                            837
Smith, C. J.                             836
VAUGHN, Judge.                           768
Adams, J.                                743
JOHNSON, Judge.                          739
WYNN, Judge.                             723
EAGLES, Judge.                           703
Clark, C. J.                             695
MARTIN, Judge.                           658
MORRIS, Judge.                           613
Battle, J.                               610
BRITT, Judge.                            603
McGEE, Judge.                            587
Walker, J.,                              576
Bobbitt, J.                              575
LEWIS, Jud

Not pretty.  As a first step, I will use numerous regular expressions to attempt to remove all but the last name of the judge:

In [21]:
df_model['judge'] = df_model['judge'].str.lower().str.replace('per curiam', 'per_curiam').str.replace(' [a-z]\.', "").str.replace('judge', "").str.replace('justice', "").str.replace('chief', "")
df_model['judge'] = df_model['judge'].str.replace('^ +', "").str.replace('[,\.:\"\';-]', "").str.replace('[0-9]', "").str.replace(' ', "")

In [22]:
df_model['judge'].value_counts()

per_curiam            4488
clark                 2504
stacy                 1822
parker                1819
walker                1539
smith                 1517
hedrick               1513
martin                1371
greene                1179
connor                1174
arnold                1167
johnson               1138
ruffin                1120
hoke                  1028
vaughn                1027
morris                1008
wells                  948
wynn                   934
eagles                 919
britt                  875
pearson                831
adams                  807
webb                   800
brock                  796
devin                  789
bobbitt                760
rodman                 752
denny                  723
allen                  715
higgins                708
                      ... 
bhogdbn                  1
fttrcbes                 1
denotj                   1
eurfirr                  1
rutfxn                   1
johnsontaylormacay       1
e

While the regular expressions certainly helped, the data still contains 4,718 distinct judge names, which is substantially more than expected.  The major culprit is inaccurate OCR scans that have resulted in misspellings.  There unfortunately is not a quick fix to this issue and correcting each and every spelling error was not a practical solution given my time constraints for this project.  I therefore focused on names that had appeared in the dataset at least 3 times and were able to be identified (using some cursory googling of North Carolina judges) and consolidated with the correct judge name spelling using the `str.replace()` method.  The one-or-two-off misspellings were dropped from the dataset.  Here are the results:

In [34]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80749 entries, 0 to 80748
Data columns (total 9 columns):
index          80749 non-null int64
casename       80749 non-null object
date           80748 non-null object
id             80749 non-null int64
judge          80749 non-null object
opinion        80749 non-null object
type           80749 non-null object
judge_count    80749 non-null int64
year           80748 non-null float64
dtypes: float64(1), int64(3), object(5)
memory usage: 6.2+ MB


In [31]:
df_model['judge'].value_counts()

per_curiam         6440
clark              3625
parker             2069
walker             1979
stacy              1936
smith              1625
pearson            1564
hedrick            1541
connor             1447
martin             1378
clarkson           1294
hoke               1251
ruffin             1247
brown              1243
greene             1184
arnold             1167
johnson            1162
morris             1040
vaughn             1035
hunter              965
wells               948
wynn                934
winborne            925
rodman              924
merrimon            921
britt               921
eagles              919
denny               902
battle              885
adams               857
                   ... 
jones                45
ham                  42
valentine            41
doderidge            41
arrowood             40
locke                39
varser               34
cameron              33
johnston             31
haywood              25
butterfield     

Through this iterative process, I was able to reduce the total number of judges in the dataset to just over 150, a much more managable and accurate number.  I ended up dropping 3,649 rows (which accounted for the one-or-two off OCR errors in the judge name), which made up 4.3% of the dataset.  I concluded that this was a worthwhile trade-off given the additional amount of time it would have taken to correct those misspellings one or two rows at a time.

### Preparing the opinion text

The final column of data to address is the `opinion` column, which contains the narrative text of each court decision.  Let's take a look at an example opinion text to see if it contains any type of glaring issues that I should address before we vectorize the text into a document-term matrix:

In [9]:
df_model.loc[240, 'opinion']

'BUTTERFIELD, Justice.\nOn 2 February 1998, defendant was indicted for first-degree murder and for robbery with a dangerous weapon. Defendant was tried capitally before a jury at the 12 October 1998 Criminal Session of Superior Court, New Hanover County. The jury found defendant guilty of first-degree murder on the basis of premeditation and deliberation and under the felony murder rule. The jury also found defendant guilty of robbery with a dangerous weapon. Following a capital sentencing proceeding, the jury recommended a sentence of death for the first-degree murder conviction. On 23 October 1998, the trial court sentenced defendant to death. The trial court also sentenced defendant to a consecutive minimum sentence of 103 months’ imprisonment and a maximum of 133 months’ imprisonment for the robbery conviction. Defendant appealed his sentence of death for first-degree murder to this Court as of right. On 24 February 2000, this Court allowed defendant’s motion to bypass the Court of

The text appears to be in pretty good shape for this stage of the project.  I do not want to remove numerical information or citations, as those characteristics may be relevant to identifying which judge wrote an opinion.  It would not surprise me at all for certain judges to cite to legal authority more often than others, for example.

The only troubling issue is that the judge's name often appears as the first sentence of the opinion.  I do not want that information in the text of the opinion, as it is likely to skew the results of the model.  I can address this issue through the `stop_words` parameter of the vectorizer.  I will first build a list of the judges names and various other terms to include as stop words:

In [2]:
df_model = pd.read_csv('df_model_4.csv', index_col=0)

In [3]:
# create list of judges names
stop_words = list(set(df_model['judge']))
print(sorted(stop_words))
print(type(stop_words))

['adams', 'allen', 'arnold', 'arrowood', 'ashe', 'avery', 'baley', 'barnhill', 'battle', 'beasley', 'becton', 'biggs', 'billings', 'bobbitt', 'boyden', 'brady', 'branch', 'braswell', 'britt', 'brock', 'brogden', 'brown', 'bryant', 'burwell', 'butterfield', 'bynum', 'calabria', 'cameron', 'campbell', 'carlton', 'carson', 'chase', 'clark_edward', 'clark_walter', 'clarkson', 'colcock', 'connor', 'cook', 'cooke', 'copeland', 'cozort', 'crew', 'daniel', 'davis', 'denny', 'devin', 'dick', 'dillard', 'dillon', 'doderidge', 'doderidge_jones', 'douglas', 'duncan', 'eagles', 'edmunds', 'elmore', 'ervin', 'erwin', 'evans', 'exum', 'faircloth', 'freeman', 'frye', 'fuller', 'furches', 'gaston', 'geer', 'grahall', 'greene', 'hall', 'haywood', 'hedrick', 'henderson', 'higgins', 'hill', 'hoke', 'horton', 'hudson', 'hunter', 'huskins', 'jackson', 'john', 'johnson_clifton', 'johnson_jefferson', 'johnston', 'johnston_macay', 'jones', 'lake', 'levinson', 'lewis', 'locke', 'lowrie', 'macrae', 'mallard', 'm

I need to add a few more terms to this list to make it complete, as we added underscores to the judges who shared a last name.  I will also add generic terms like 'judge', 'justice', 'per', and 'curiam':

In [4]:
add_stop_words = ['clark', 'johnson', 'martin', 'parker', 'timmons', 'goodson', 'walker', 'judge', 'justice', 'per', 'curiam']
stop_words += add_stop_words
print(sorted(stop_words))

['adams', 'allen', 'arnold', 'arrowood', 'ashe', 'avery', 'baley', 'barnhill', 'battle', 'beasley', 'becton', 'biggs', 'billings', 'bobbitt', 'boyden', 'brady', 'branch', 'braswell', 'britt', 'brock', 'brogden', 'brown', 'bryant', 'burwell', 'butterfield', 'bynum', 'calabria', 'cameron', 'campbell', 'carlton', 'carson', 'chase', 'clark', 'clark_edward', 'clark_walter', 'clarkson', 'colcock', 'connor', 'cook', 'cooke', 'copeland', 'cozort', 'crew', 'curiam', 'daniel', 'davis', 'denny', 'devin', 'dick', 'dillard', 'dillon', 'doderidge', 'doderidge_jones', 'douglas', 'duncan', 'eagles', 'edmunds', 'elmore', 'ervin', 'erwin', 'evans', 'exum', 'faircloth', 'freeman', 'frye', 'fuller', 'furches', 'gaston', 'geer', 'goodson', 'grahall', 'greene', 'hall', 'haywood', 'hedrick', 'henderson', 'higgins', 'hill', 'hoke', 'horton', 'hudson', 'hunter', 'huskins', 'jackson', 'john', 'johnson', 'johnson_clifton', 'johnson_jefferson', 'johnston', 'johnston_macay', 'jones', 'judge', 'justice', 'lake', 'l

### Creating a document-term matrix

As a final step, I will convert the opinion text into a document term matrix using `TfidfVectorizer`.  Feeding the entire corpus into the vectorizer with no restrictions other than the judge-name-specific stop words discussed above resulted in a vocabulary of 212,423 words.  This was much larger than expected and is likely the result of OCR mistakes or other one-off uses of words.  So, I used the `min_df` parameter to tune the vectorizer to produce more manageble and relevant corpus vocabularies.  Here are the results:

|  Min_df | Vocabulary |
|:-------:|:----------:|
| 0.02    | 3677       |
| 0.001   | 16905      |
| 0.0005  | 22811      |
| 0.00001 | 44088      |
| 0.00005 | 57795      |
| 0.00    | 212423     |


A `min_df` value of 0.02 produces a vocabulary of only 3677 words.  I likely will start with this vocabulary to see how my model performs and then incrementally increase the vocabulary size to track how it impacts the model's performance.


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words, min_df=0.001)
data_tfidf = tfidf.fit_transform(df_model['opinion'])
df_data = pd.DataFrame(data_tfidf.toarray(), columns=tfidf.get_feature_names())
df_data.index = df_model.index

df_data.head()


Unnamed: 0,00,000,01,02,03,04,05,050,06,07,08,09,0f,10,100,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,101,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,102,1020,1021,1022,1023,1024,1025,1026,1027,1028,1029,103,1030,1031,1032,1033,1035,1036,1037,1038,1039,104,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,105,1050,1051,1052,1053,1054,1055,1056,1057,1059,106,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,107,1070,1072,1073,1074,1075,1076,1077,1078,1079,108,1080,1081,1082,1083,1084,1086,1087,1089,108a,109,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,10a,10th,11,110,1100,1101,1102,1103,1104,1105,1106,1107,1108,1109,111,1110,1111,1112,1113,1114,1117,112,1120,1122,1123,1124,1125,1126,1127,1128,1129,113,1130,1131,1133,1134,1135,1137,1138,1139,113a,114,1143,1145,1147,1148,1149,115,1152,1154,1156,1159,115c,116,1160,1161,1164,1165,1167,1169,117,1178,118,1180,1181,1183,1189,119,1194,1197,1199,11th,12,120,1200,1201,1205,1206,1207,1208,1209,121,1210,1211,1212,1213,1214,1215,122,1221,1222,1225,1226,1227,1228,1229,122c,123,1230,1231,1232,1233,1234,1235,1236,1237,124,1240,1241,1242,1243,1245,1246,1247,1249,125,1250,1251,1253,1254,126,127,128,1283,129,12th,13,130,1300,1302,...,willingness,willis,williston,willoughby,wills,wilmington,wilson,win,winbobne,winboene,winchester,wind,winders,windfall,winding,windley,window,windows,winds,windshield,wine,winfield,wing,wingo,winn,winner,winning,winslow,winstead,winston,winter,winters,wipe,wiped,wire,wired,wires,wiring,wis,wisconsin,wisdom,wise,wisely,wiseman,wiser,wisest,wish,wished,wishes,wishing,wit,witb,witbin,with,withdraw,withdrawal,withdrawals,withdrawing,withdrawn,withdraws,withdrew,withers,witherspoon,withheld,withhold,withholding,withholds,within,without,withstand,witness,witnessed,witnesses,witnessing,witt,wives,wjhere,wjhether,wl,wm,wms,wo,woke,wolf,wolfe,womack,woman,womble,women,won,wonder,wong,wood,woodard,wooded,wooden,woodfin,woodhouse,woodland,woodlief,woodmen,woodruff,woods,woodson,woodward,woody,wool,woolard,wooten,word,worded,wording,words,wore,work,workable,worked,worker,workers,working,workman,workmanlike,workmanship,workmen,workplace,works,worksheet,worland,world,worley,worn,worried,worry,worse,worsened,worship,worsley,worst,worth,wortham,worthington,worthless,worthy,would,wouldn,wound,wounded,wounding,wounds,wrapped,wras,wray,wreck,wrecked,wrecker,wrenn,wright,wrightsville,wrist,wrists,writ,write,writer,writers,writes,writing,writings,writs,written,wrong,wrongdoer,wrongdoers,wrongdoing,wronged,wrongful,wrongfully,wrongly,wrongs,wrote,wrought,wyatt,wynne,wyo,wás,wé,xi,xii,xiii,xiv,ya,yadkin,yale,yance,yancey,yarborough,yarbrough,yard,yards,yarn,yates,ye,yeah,year,yearly,years,yell,yelled,yelling,yellow,yelverton,yes,yesterday,yet,yield,yielded,yielding,yields,yii,yirginia,yol,yon,york,you,young,youngblood,younger,youngest,yount,your,yours,yourself,yourselves,youth,youthful,yow,zachary,zeal,zealous,zero,zimmerman,zone,zoned,zones,zoning,zuniga,zurich,ánd,áre,ás,óf
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005499,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015969,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015589,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01532,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005415,0.006923,0.0,0.0,0.0,0.008374,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.002282,0.016448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002893,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005468,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002046,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002775,0.0,0.0,0.0,0.0,0.0,0.0,0.002841,0.0,0.0,0.0,0.0,0.001943,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025303,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.008406,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017213,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011109,0.0,0.0,0.0,0.0,0.0,0.045348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002829,0.007232,0.0,0.021621,0.0,0.017496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006554,0.0,0.0,0.0,0.003421,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.007688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00211,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00211,0.0,0.003594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.003825,0.0,0.0,0.006578,0.0,0.006501,0.0,0.0,0.0,0.0,0.0,0.0,0.008856,0.003916,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008488,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004574,0.0,0.0,0.0,0.003433,0.013371,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0304,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004605,0.017661,0.0,0.0,0.0,0.003561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015604,0.0,0.018257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005531,0.0,0.0,0.0,0.0,0.005981,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003435,0.0,0.011704,0.0,0.0,0.0,0.0,0.0,0.004504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034325,0.004844,0.0,0.0,0.0,0.0,0.004159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.020194,0.0,0.0,0.019954,0.0,0.0,0.0,0.0,0.020754,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013864,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013329,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007067,0.018068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015615,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80749 entries, 0 to 80748
Columns: 16905 entries, 00 to óf
dtypes: float64(16905)
memory usage: 10.2 GB


Before that step, in the next notebook I will conduct some exploratory data analysis.