# MachineHack - Classifying Movie Scripts : Predict The Movie Genre

## Leaderboard - 19th rank

Labeling text data can be hard. To use the available information to auto-create or predict the labels can be an interesting machine learning task. Using the power of Natural Language Processing (NLP) the unstructured text data can be leveraged to auto-generate the right classes for the test data in the future.

In order to accomplish this, we have scraped close to 2000 movie scripts and the respective genres.

As some of the scripts are huge, it would be interesting to figure out new ways of feature extraction and different NLP techniques.

In this hackathon participants are challenged to use the movie script to design a Natural language processing system that can help the customer classify it into the right genre in the coming future.

The current platform struggles to classify the movies with an accuracy above 90%. However, we at MachineHack feel that the current state of the art NLP algorithms such as BERT and OpenGPT have paved the way to design more robust systems which can understand the context of the provided text data.
Data Description

The unzipped folder will have the following files.

    Train.csv – 1978 script file names with the class labels.
    Test.csv – 849 script file names without the class labels.
    Scripts – Folder with 2827 scripts .txt files.
    Sample Submission – Sample format for the submission.
    Started Notebook – A simple benchmark notebook.


In [1]:
from gensim.models import keyedvectors
import os, re, string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gc
gc.collect()

17

In [2]:
print(os.listdir())

['.ipynb_checkpoints', 'answer.xlsx', 'Classify_Movie.ipynb', 'GoogleNews-vectors-negative300.bin', 'GoogleNews-vectors-negative300.bin.gz', 'Imp_file', 'Imp_file_2', 'Imp_file_3', 'Movie_Scripts_Sample_Submission.xlsx', 'Scripts', 'Test.csv', 'Train.csv', 'wiki-news-300d-1M-subword.vec', 'wiki-news-300d-1M-subword.vec.zip']


In [3]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
sample = pd.read_excel('Movie_Scripts_Sample_Submission.xlsx')

In [4]:
sample.head()

Unnamed: 0,File_Name,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,file_2300.txt,0.254897,0.07087,0.034026,0.001088,0.076833,0.055085,0.108631,0.014625,0.040442,...,0.002126,0.007801,0.031488,0.047958,0.081665,0.001089,0.000549,0.103292,0.009705,0.004852
1,file_809.txt,0.072761,0.042652,0.011684,0.001046,0.128757,0.067255,0.234022,0.01294,0.034502,...,0.002092,0.007767,0.088627,0.070536,0.04284,0.001048,0.000524,0.114109,0.008936,0.004475
2,file_1383.txt,0.176885,0.055458,0.011611,0.00098,0.082053,0.046127,0.297048,0.011906,0.036915,...,0.001932,0.007315,0.030388,0.06044,0.041038,0.000975,0.000483,0.079213,0.009655,0.004332
3,file_983.txt,0.077208,0.040396,0.010318,0.00094,0.110868,0.069286,0.193717,0.01127,0.031018,...,0.001883,0.00681,0.031463,0.055402,0.040821,0.000947,0.000475,0.26013,0.008493,0.004113
4,file_1713.txt,0.108292,0.04444,0.010759,0.000963,0.081152,0.073162,0.174912,0.011675,0.032233,...,0.001917,0.007004,0.032439,0.050234,0.046141,0.000974,0.000484,0.26462,0.009009,0.004406


In [5]:
train.head()

Unnamed: 0,File_Name,Labels
0,file_2180.txt,8
1,file_693.txt,4
2,file_2469.txt,6
3,file_2542.txt,6
4,file_378.txt,16


In [6]:
test.head()

Unnamed: 0,File_Name
0,file_2300.txt
1,file_809.txt
2,file_1383.txt
3,file_983.txt
4,file_1713.txt


In [7]:
%%time
print(os.getcwd())

from nltk.corpus import PlaintextCorpusReader
script = PlaintextCorpusReader('Scripts', '.*')

files = {}
raw = {}
for fileids in script.fileids() :
    files[fileids] = script.raw(fileids)

C:\Users\Rahul\Desktop\Titu
Wall time: 1min 1s


In [8]:
df = train.append(test,sort=False)
df.reset_index(drop = True, inplace = True)

In [9]:
df['script'] = df.File_Name.apply(lambda x: files[x])

In [10]:
df.head()

Unnamed: 0,File_Name,Labels,script
0,file_2180.txt,8.0,"\t\t\tCrouching Tiger, Hidden Dragon\n\n\t\t\t..."
1,file_693.txt,4.0,"""MUMFO..."
2,file_2469.txt,6.0,MAX PAYNE\n\n ...
3,file_2542.txt,6.0,SLUMDOG MILLIONAIRE\n\n ...
4,file_378.txt,16.0,<b><!--\n\n</b>if (window!= top)\n\ntop.locati...


# 1. BASIC CLEANING

In [59]:
%%time
df.script = df.script.apply(lambda x: re.sub(r"\n+"," ",x))
df.script = df.script.apply(lambda x: re.sub(r"\t+"," ",x))
df.script = df.script.apply(lambda x: re.sub(r"<b>.+</b>"," ",x))
df.script = df.script.apply(lambda x: re.sub(r"/+"," ",x))
import tqdm
for c in tqdm.tqdm(string.punctuation):
    if c not in ['\\','^']:
        df.script = df.script.apply(lambda x: re.sub(r"[{}]+".format(c),c,x))

df.script = df.script.apply(lambda x: re.sub(r"\d+"," ",x))
df.script = df.script.apply(lambda x: re.sub(r"\s{2,}"," ",x))

100%|██████████████████████████████████████████████████████████████████████████████████| 32/32 [07:07<00:00, 13.35s/it]


Wall time: 8min 43s


In [63]:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "He had",
"he'd've": "He would have",
"he'll": "He will",
"he'll've": "He will have",
"he's": "He is",
"how'd": "How did",
"how'd'y": "How do you",
"how'll": "How will",
"how's": "How is",
"i'd": "I had",
"i'd've": "I would have",
"i'll": "I will",
"i'll've": "I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "It had",
"it'd've": "It would have",
"it'll": "It will",
"it'll've": "It will have",
"it's": "It is",
".it's": "It is",
"let's": "Let us",
"ma'am": "Madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "She had",
"she'd've": "She would have",
"she'll": "She will",
"she'll've": "She will have",
"she's": "She is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that had",
"that'd've": "that would have",
"that's": "that is",
"there'd": "There had",
"there'd've": "There would have",
"there's": "There has",
"they'd": "They had",
"they'd've": "They would have",
"they'll": "They will",
"they'll've": "They will have",
"they're": "They are",
"they've": "They have",
"to've": "to have",
"wasn't": "was not",
"we'd": "We had",
"we'd've": "We would have",
"we'll": "We will",
"we'll've": "We will have",
"we're": "We are",
"we've": "We have",
"weren't": "were not",
"what'll": "What will",
"what'll've": "What will have",
"what're": "What are",
"what's": "What is",
"what've": "What have",
"when's": "When is",
"when've": "When have",
"where'd": "Where did",
"where's": "Where is",
"where've": "Where have",
"who'll": "Who will",
"who'll've": "Who will have",
"who's": "Who is",
"who've": "Who have",
"why's": "Why is",
"why've": "Why have",
"will've": "ill have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "You all",
"y'all'd": "You all would",
"y'all'd've": "You all would have",
"y'all're": "You all are",
"y'all've": "You all have",
"you'd": "You had",
"you'd've": "You would have",
"you'll": "You will",
"you'll've": "You will have",
"you're": "You are",
"you've": "You have"
}
df.script = df.script.apply(lambda x: ' '.join([contractions[x.lower()] if x.lower() in contractions.keys() else x 
                                                for x in x.split()]))

In [72]:
df.script = df.script.apply(lambda x: re.sub(r"'s?S?","",x))

# 2. Lemamtize

In [77]:
%%time
import nltk, time
print(time.ctime())
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet as wn

lem = WordNetLemmatizer()
pos = {'R':wn.ADV, 'J':wn.ADJ, 'N':wn.NOUN, 'V':wn.VERB}

def func(x):
    all = []
    for s in nltk.sent_tokenize(x):
        all.extend([lem.lemmatize(w, pos = pos.get(t[0], wn.NOUN)) 
                    for w,t in nltk.pos_tag(nltk.word_tokenize(s))])
    return(all)
    
df.script = df.script.apply(lambda x: func(x))

Tue May 12 20:31:24 2020
Wall time: 3h 35min 42s


import pickle
f = open('Imp_file_3','wb')
pickle.dump(df.script, file = f)
f.close()

%%time
import pickle
f = open('Imp_file_3','rb')
df.script = pickle.load(f)
f.close()

In [12]:
df.head()

Unnamed: 0,File_Name,Labels,script
0,file_2180.txt,8.0,"[Crouching, Tiger, ,, Hidden, Dragon, by, Wang..."
1,file_693.txt,4.0,"[``, MUMFORD, '', Screenplay, by, Lawrence, Ka..."
2,file_2469.txt,6.0,"[MAX, PAYNE, Written, by, Beau, Michael, Thorn..."
3,file_2542.txt,6.0,"[SLUMDOG, MILLIONAIRE, Written, by, Simon, Bea..."
4,file_378.txt,16.0,"[The, Abyss, -, by, James, Cameron, THE, ABYSS..."


In [13]:
df.Labels.value_counts()

6.0     405
19.0    261
4.0     243
0.0     203
5.0     141
15.0    134
1.0     116
16.0    109
11.0    104
8.0      79
14.0     75
7.0      27
2.0      25
20.0     18
13.0     15
21.0      9
12.0      4
9.0       3
3.0       2
10.0      2
17.0      2
18.0      1
Name: Labels, dtype: int64

In [14]:
a = [3.0,10.0,17.0,18.0]
d = df[df.Labels.apply(lambda x: True if x in a else False)].index
df[df.Labels.apply(lambda x: True if x in a else False)]

Unnamed: 0,File_Name,Labels,script
152,file_29.txt,3.0,"[WARM, SPRINGS, Written, by, Margaret, Nagle, ..."
986,file_1559.txt,10.0,"[THE, LAST, STATION, Written, by, Michael, Hof..."
1046,file_431.txt,17.0,"[QUANTUM, PROJECT, :, ORIGINAL, SCREENPLAY, IN..."
1272,file_432.txt,17.0,"[MY, MOTHER, DREAMS, THE, SATAN, DISCIPLES, IN..."
1337,file_1561.txt,10.0,"[THE, OTHER, BOLEYN, GIRL, Written, by, Peter,..."
1654,file_30.txt,3.0,"[``, THE, DOORS, '', Screenplay, by, Randall, ..."
1859,file_922.txt,18.0,"[SPEED, RACER, Written, by, Larry, &, Andy, Wa..."


In [15]:
d

Int64Index([152, 986, 1046, 1272, 1337, 1654, 1859], dtype='int64')

In [16]:
df.drop(index = d, inplace = True)
df.reset_index(drop = True, inplace = True)

df['author'] = df.script.str.extract(r'by ([a-zA-Z]*)')[0]
df.author.value_counts().count()

author = df.author.value_counts().index[:50]
df.author = df.author.apply(lambda x: x if x in author else 'other')

## 3. Remove StopOwrds

In [17]:
from nltk.corpus import stopwords
st_words = set(stopwords.words('english'))

In [18]:
len(df.script[0:1][0])

14079

In [19]:
%%time
def func(x):
    return([a.lower() for a in x if a.lower() not in st_words])

df.script = df.script.apply(lambda x: func(x))

Wall time: 36.9 s


# 4. Punctuation

In [20]:
punct  = {w for w in string.punctuation}

In [21]:
%%time
df.script = df.script.apply(lambda x: [w for w in x if w not in punct])

Wall time: 8.34 s


In [22]:
df.script = df.script.apply(lambda x: [w for w in x if w not in ['``','//','--',"''"]])

## 3.Checking wrong spelling

from nltk.corpus import wordnet as wn
all_words = {w for w in wn.words()}
len(all_words)

%%time
def func(x):
    return(' '.join([w for w in x if w in all_words]))

df.script = df.script.apply(lambda x: func(x))

In [23]:
%%time
all_words = set()
df.script.apply(lambda x: all_words.update(x))
print(len(all_words))

204199
Wall time: 1.76 s


In [24]:
import warnings
warnings.filterwarnings(action = 'ignore')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import *
from keras.models import Sequential
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint
from keras.optimizers import *

Using TensorFlow backend.


In [25]:
print(df.script.apply(lambda x: len(x)).max())
print(df.script.apply(lambda x: len(x)).min())
print(df.script.apply(lambda x: len(x)).mean())

28793
837
13210.282624113475


In [26]:
%%time
max_words = 80000
maxlen = 10000
token = Tokenizer(max_words)
token.fit_on_texts(all_words)

Wall time: 2.02 s


In [27]:
%%time
df.script = token.texts_to_sequences(df.script)
seq = pad_sequences(sequences= df.script, maxlen=maxlen, padding='post')

Wall time: 18 s


In [28]:
print(os.listdir())

['.ipynb_checkpoints', 'answer.xlsx', 'Classify_Movie.ipynb', 'GoogleNews-vectors-negative300.bin', 'GoogleNews-vectors-negative300.bin.gz', 'Imp_file', 'Imp_file_2', 'Imp_file_3', 'Movie_Scripts_Sample_Submission.xlsx', 'Scripts', 'Test.csv', 'Train.csv', 'wiki-news-300d-1M-subword.vec', 'wiki-news-300d-1M-subword.vec.zip']


In [29]:
%%time

vec = keyedvectors.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)

Wall time: 1min 20s


In [30]:
%%time
import tqdm
emb = dict()

for word in tqdm.tqdm(vec.wv.vocab):
    emb[word] = vec.word_vec(word)

gc.collect()

100%|████████████████████████████████████████████████████████████████████| 3000000/3000000 [00:05<00:00, 543668.42it/s]


Wall time: 6.5 s


9

%%time
import tqdm
emb = dict()
file = open('wiki-news-300d-1M-subword.vec','r', encoding='utf-8')
for f in tqdm.tqdm(file):
    f = f.rstrip().split(' ')
    emb[f[0]] = np.array(f[1:], dtype = 'float')
file.close()

In [31]:
%%time
emb_matrix = np.zeros(shape = (len(token.word_index), 300))
count = 0
for w, i in tqdm.tqdm(token.word_index.items()):
    i = i - 1
    val = emb.get(w)
    if val is not None :
        emb_matrix[i] = val
        count += 1
print(f"Total element found : {count}")
del emb

100%|██████████████████████████████████████████████████████████████████████| 122913/122913 [00:00<00:00, 307745.96it/s]

Total element found : 63982
Wall time: 809 ms





In [32]:
emb_matrix.shape

(122913, 300)

In [60]:
def makers():
    model = Sequential()
    model.add(Embedding(len(token.word_index), output_dim = 300, input_length= maxlen, trainable = False, weights = [emb_matrix]))
    model.add(SpatialDropout1D(rate = 0.4))
    model.add(Conv1D(filters = 16, kernel_size=3, strides = 2, padding = 'valid', activation = 'relu'))
    model.add(Conv1D(filters = 32, kernel_size=3, strides = 2, padding = 'valid', activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(p = 0.5))
    
    model.add(Conv1D(filters = 64, kernel_size=4, strides = 2, padding = 'valid', activation = 'relu'))
    model.add(MaxPool1D(2,1))
    
    model.add(BatchNormalization())
    model.add(Dropout(p = 0.4))

    model.add(AveragePooling1D())
    model.add(GlobalAveragePooling1D())
    
    model.add(BatchNormalization())
    model.add(Dropout(p = 0.3))

    model.add(Dense(64, activation = 'relu'))
    model.add(BatchNormalization())
    model.add(Dropout(p = 0.2))
    
    model.add(Dense(18, activation = 'softmax'))
    
    model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'], optimizer = Adagrad(lr = 0.02))
    call = ModelCheckpoint(monitor='val_loss', save_best_only=True, filepath='best_1.hdf5', verbose=1)
    
    return(model, call)

In [61]:
model,_ = makers()
model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 10000, 300)        36873900  
_________________________________________________________________
spatial_dropout1d_10 (Spatia (None, 10000, 300)        0         
_________________________________________________________________
conv1d_28 (Conv1D)           (None, 4999, 16)          14416     
_________________________________________________________________
conv1d_29 (Conv1D)           (None, 2499, 32)          1568      
_________________________________________________________________
batch_normalization_37 (Batc (None, 2499, 32)          128       
_________________________________________________________________
dropout_37 (Dropout)         (None, 2499, 32)          0         
_________________________________________________________________
conv1d_30 (Conv1D)           (None, 1248, 64)        

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2820 entries, 0 to 2819
Data columns (total 3 columns):
File_Name    2820 non-null object
Labels       1971 non-null float64
script       2820 non-null object
dtypes: float64(1), object(2)
memory usage: 66.2+ KB


In [63]:
import random
import tensorflow as tf
ran = 8998
np.random.seed(ran)
random.seed(ran)
tf.random.set_random_seed(ran)

In [64]:
ind = [a for a in range(1971)]
random.shuffle(ind)

tr_seq = seq[:1971]
te_seq = seq[1971:]
label = pd.get_dummies(df.Labels[:1971].astype('category'))

tr_seq = tr_seq[ind]
label = label.loc[ind,:]

tr_seq.shape, te_seq.shape, label.shape

((1971, 10000), (849, 10000), (1971, 18))

In [65]:
%%time
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(random_state=8998, n_splits=3)
d = pd.DataFrame(np.reshape(tr_seq,(1971,maxlen)))
l = label.values

pred_0 = pd.DataFrame(np.zeros(shape = (test.shape[0], 18)))
pred_1 = pd.DataFrame(np.zeros(shape = (test.shape[0], 18)))
pred_2 = pd.DataFrame(np.zeros(shape = (test.shape[0], 18)))

for i, (tr_index, te_index) in enumerate(sss.split(tr_seq, label)):
    xtrain, xtest = d.loc[tr_index,:], d.loc[te_index,:]
    ytrain, ytest = l[tr_index], l[te_index]   
    model,call = makers()
    model.fit(xtrain, ytrain, epochs = 20, batch_size=128, callbacks=[call], shuffle=False, validation_data=[xtest, ytest])
    print('#'*40)
    model.load_weights('best_1.hdf5')
    
    if i ==0:
        pred_0 = pd.DataFrame(model.predict_proba(np.reshape(te_seq,(849,maxlen))))
    elif i ==1:
        pred_1 = pd.DataFrame(model.predict_proba(np.reshape(te_seq,(849,maxlen))))
    else :
        pred_2 = pd.DataFrame(model.predict_proba(np.reshape(te_seq,(849,maxlen))))
    os.remove('best_1.hdf5')

Train on 1773 samples, validate on 198 samples
Epoch 1/20

Epoch 00001: val_loss improved from inf to 2.65425, saving model to best_1.hdf5
Epoch 2/20

Epoch 00002: val_loss improved from 2.65425 to 2.56522, saving model to best_1.hdf5
Epoch 3/20

Epoch 00003: val_loss improved from 2.56522 to 2.50918, saving model to best_1.hdf5
Epoch 4/20

Epoch 00004: val_loss improved from 2.50918 to 2.48475, saving model to best_1.hdf5
Epoch 5/20

Epoch 00005: val_loss improved from 2.48475 to 2.47150, saving model to best_1.hdf5
Epoch 6/20

Epoch 00006: val_loss improved from 2.47150 to 2.46621, saving model to best_1.hdf5
Epoch 7/20

Epoch 00007: val_loss did not improve from 2.46621
Epoch 8/20

Epoch 00008: val_loss did not improve from 2.46621
Epoch 9/20

Epoch 00009: val_loss did not improve from 2.46621
Epoch 10/20

Epoch 00010: val_loss did not improve from 2.46621
Epoch 11/20

Epoch 00011: val_loss did not improve from 2.46621
Epoch 12/20

Epoch 00012: val_loss did not improve from 2.46621


In [66]:
pred_0.shape, pred_1.shape, pred_2.shape

((849, 18), (849, 18), (849, 18))

In [67]:
name = [name for name in label.columns]
print(a)
label.head()

[3.0, 10.0, 17.0, 18.0]


Unnamed: 0,0.0,1.0,2.0,4.0,5.0,6.0,7.0,8.0,9.0,11.0,12.0,13.0,14.0,15.0,16.0,19.0,20.0,21.0
1524,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1829,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
821,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
731,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1917,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [68]:
temp = pd.DataFrame(np.zeros(shape = (849,18)), columns = list(label.columns))

In [69]:
pred_0.columns = list(label.columns)
pred_1.columns = list(label.columns)
pred_2.columns = list(label.columns)

In [70]:
for c in list(label.columns) :
    temp.loc[:,c] = (pred_0.loc[:,c] + pred_1.loc[:,c] + pred_2.loc[:,c])/3 

In [71]:
temp.head(1)

Unnamed: 0,0.0,1.0,2.0,4.0,5.0,6.0,7.0,8.0,9.0,11.0,12.0,13.0,14.0,15.0,16.0,19.0,20.0,21.0
0,0.102208,0.05755,0.015463,0.132605,0.068318,0.192067,0.016707,0.03047,0.008943,0.047117,0.009655,0.015716,0.04297,0.064523,0.055001,0.122098,0.010403,0.008188


In [72]:
#temp1 = pd.DataFrame(pred_0, columns = [label.columns])
temp2 = pd.DataFrame(np.full(shape = (849,4), fill_value = 1e-7), columns = a)

temp = pd.concat([temp,temp2],axis =1)
#temp.columns = [x for x in label.columns]+a

temp = temp.reindex(sorted(temp.columns), axis = 1)
temp.head()

Unnamed: 0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0
0,0.102208,0.05755,0.015463,1e-07,0.132605,0.068318,0.192067,0.016707,0.03047,0.008943,...,0.009655,0.015716,0.04297,0.064523,0.055001,1e-07,1e-07,0.122098,0.010403,0.008188
1,0.102727,0.067118,0.016193,1e-07,0.105526,0.070758,0.181599,0.014854,0.034991,0.010102,...,0.009117,0.014875,0.044882,0.052217,0.063182,1e-07,1e-07,0.136721,0.013184,0.01068
2,0.116069,0.077552,0.018183,1e-07,0.100077,0.064733,0.162305,0.016218,0.038896,0.010066,...,0.010001,0.015043,0.041962,0.048678,0.066594,1e-07,1e-07,0.136924,0.012886,0.009817
3,0.110151,0.075363,0.015053,1e-07,0.145473,0.053138,0.165387,0.027289,0.037292,0.007241,...,0.008493,0.017644,0.03557,0.067081,0.053236,1e-07,1e-07,0.116939,0.007769,0.006463
4,0.117263,0.078553,0.01836,1e-07,0.096927,0.065885,0.163071,0.016001,0.038936,0.009926,...,0.009673,0.014448,0.042272,0.046606,0.068717,1e-07,1e-07,0.136998,0.013114,0.009682


In [73]:
d = temp

In [74]:
sample

Unnamed: 0,File_Name,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,file_2300.txt,0.254897,0.070870,0.034026,0.001088,0.076833,0.055085,0.108631,0.014625,0.040442,...,0.002126,0.007801,0.031488,0.047958,0.081665,0.001089,0.000549,0.103292,0.009705,0.004852
1,file_809.txt,0.072761,0.042652,0.011684,0.001046,0.128757,0.067255,0.234022,0.012940,0.034502,...,0.002092,0.007767,0.088627,0.070536,0.042840,0.001048,0.000524,0.114109,0.008936,0.004475
2,file_1383.txt,0.176885,0.055458,0.011611,0.000980,0.082053,0.046127,0.297048,0.011906,0.036915,...,0.001932,0.007315,0.030388,0.060440,0.041038,0.000975,0.000483,0.079213,0.009655,0.004332
3,file_983.txt,0.077208,0.040396,0.010318,0.000940,0.110868,0.069286,0.193717,0.011270,0.031018,...,0.001883,0.006810,0.031463,0.055402,0.040821,0.000947,0.000475,0.260130,0.008493,0.004113
4,file_1713.txt,0.108292,0.044440,0.010759,0.000963,0.081152,0.073162,0.174912,0.011675,0.032233,...,0.001917,0.007004,0.032439,0.050234,0.046141,0.000974,0.000484,0.264620,0.009009,0.004406
5,file_629.txt,0.074194,0.103399,0.012331,0.001050,0.138855,0.064591,0.223220,0.013726,0.035549,...,0.002067,0.007823,0.033660,0.069032,0.042212,0.001046,0.000523,0.106433,0.008985,0.004599
6,file_1213.txt,0.077656,0.040499,0.010044,0.000957,0.079574,0.060107,0.202646,0.011185,0.030428,...,0.001916,0.006881,0.035360,0.063224,0.043614,0.000978,0.000474,0.272704,0.008705,0.004181
7,file_2311.txt,0.060349,0.036126,0.010418,0.000927,0.190222,0.057193,0.294957,0.011681,0.030043,...,0.001810,0.006820,0.029462,0.074952,0.036173,0.000925,0.000459,0.099893,0.007935,0.003896
8,file_1004.txt,0.162405,0.063235,0.011864,0.001056,0.076578,0.066075,0.191650,0.013228,0.035822,...,0.002080,0.007673,0.032764,0.054917,0.072364,0.001057,0.000531,0.138742,0.010504,0.004873
9,file_1382.txt,0.177775,0.141331,0.014203,0.001109,0.082644,0.057721,0.110069,0.015034,0.091969,...,0.002196,0.008330,0.036050,0.053880,0.051163,0.001096,0.000557,0.091407,0.009518,0.004948


In [75]:
for c in sample.columns[1:]:
    sample.loc[:,c] = d.loc[:,c]

In [76]:
sample

Unnamed: 0,File_Name,0,1,2,3,4,5,6,7,8,...,12,13,14,15,16,17,18,19,20,21
0,file_2300.txt,0.102208,0.057550,0.015463,1.000000e-07,0.132605,0.068318,0.192067,0.016707,0.030470,...,0.009655,0.015716,0.042970,0.064523,0.055001,1.000000e-07,1.000000e-07,0.122098,0.010403,0.008188
1,file_809.txt,0.102727,0.067118,0.016193,1.000000e-07,0.105526,0.070758,0.181599,0.014854,0.034991,...,0.009117,0.014875,0.044882,0.052217,0.063182,1.000000e-07,1.000000e-07,0.136721,0.013184,0.010680
2,file_1383.txt,0.116069,0.077552,0.018183,1.000000e-07,0.100077,0.064733,0.162305,0.016218,0.038896,...,0.010001,0.015043,0.041962,0.048678,0.066594,1.000000e-07,1.000000e-07,0.136924,0.012886,0.009817
3,file_983.txt,0.110151,0.075363,0.015053,1.000000e-07,0.145473,0.053138,0.165387,0.027289,0.037292,...,0.008493,0.017644,0.035570,0.067081,0.053236,1.000000e-07,1.000000e-07,0.116939,0.007769,0.006463
4,file_1713.txt,0.117263,0.078553,0.018360,1.000000e-07,0.096927,0.065885,0.163071,0.016001,0.038936,...,0.009673,0.014448,0.042272,0.046606,0.068717,1.000000e-07,1.000000e-07,0.136998,0.013114,0.009682
5,file_629.txt,0.101409,0.057754,0.015123,1.000000e-07,0.111143,0.078007,0.193379,0.013790,0.031932,...,0.008599,0.013703,0.044270,0.057254,0.060021,1.000000e-07,1.000000e-07,0.135380,0.012694,0.010216
6,file_1213.txt,0.104371,0.065762,0.016266,1.000000e-07,0.105015,0.073206,0.179934,0.014690,0.034824,...,0.009113,0.014151,0.043657,0.052158,0.063962,1.000000e-07,1.000000e-07,0.138091,0.013336,0.010361
7,file_2311.txt,0.106431,0.064723,0.016861,1.000000e-07,0.107260,0.071960,0.184577,0.014487,0.034533,...,0.009347,0.015019,0.043039,0.053101,0.063192,1.000000e-07,1.000000e-07,0.134774,0.012209,0.009886
8,file_1004.txt,0.103280,0.067036,0.016348,1.000000e-07,0.103604,0.071961,0.178674,0.014969,0.035265,...,0.009178,0.014279,0.045706,0.051911,0.063751,1.000000e-07,1.000000e-07,0.137209,0.013970,0.010768
9,file_1382.txt,0.105389,0.065988,0.017214,1.000000e-07,0.110903,0.071295,0.179280,0.015368,0.034985,...,0.009789,0.015576,0.042549,0.053380,0.062133,1.000000e-07,1.000000e-07,0.134326,0.012339,0.009924


In [77]:
sample.to_excel('answer.xlsx', index = False)