<a href="https://colab.research.google.com/github/lkarjun/fastai-workouts/blob/master/Lesson_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Packages

In [1]:
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[K     |████████████████████████████████| 727kB 8.8MB/s 
[K     |████████████████████████████████| 1.2MB 32.1MB/s 
[K     |████████████████████████████████| 194kB 50.0MB/s 
[K     |████████████████████████████████| 51kB 8.4MB/s 
[K     |████████████████████████████████| 61kB 10.4MB/s 
[K     |████████████████████████████████| 61kB 10.3MB/s 
[?25hMounted at /content/gdrive


In [2]:
from fastai.text.all import *

# Data Munging

In [None]:
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')

In [3]:
path = untar_data(URLs.IMDB)

dls = DataBlock(
    blocks = (TextBlock.from_folder(path), CategoryBlock()),
    get_y = parent_label,
    get_items = partial(get_text_files, folders = ['train', 'test']),
    splitter = GrandparentSplitter(valid_name = 'test')
).dataloaders(path)

## Transforms

In [6]:
files = get_text_files(path, folders=['train', 'test'])
txts = L(o.open().read() for o in files[:2000])

In [7]:
tok = Tokenizer.from_folder(path)
tok.setup(txts)
toks = txts.map(tok)
toks[0]

(#278) ['xxbos','xxmaj','despite','having','a','very','pretty','leading','lady','('...]

In [8]:
num = Numericalize()
num.setup(toks)
nums = toks.map(num)
nums[0][:5]

TensorText([  2,   8, 529, 269,  12])

In [9]:
nums_dec = num.decode(nums[0][:10])
nums_dec

(#10) ['xxbos','xxmaj','despite','having','a','very','pretty','leading','lady','(']

In [10]:
tok.decode(toks[0][:10])

'xxbos xxmaj despite having a very pretty leading lady ('

## Writing Your Own Transform

In [12]:
def f(x: int): return x+1

tfm = Transform(f)
tfm(2), tfm(2.0)

(3, 2.0)

In [None]:
Transform??

In [14]:
@Transform
def f(x: int): return x+1 # It will only work when the input == int type

f(2), f(2.0)

(3, 2.0)

In [21]:
class NormalizeMean(Transform):
  def setups(self, items): self.mean = sum(items)/len(items)
  def encodes(self, x): return x - self.mean
  def decodes(self, x): return x + self.mean

In [22]:
tfm = NormalizeMean()
tfm.setup([1,2,3,4,5])

In [23]:
print("Mean is: ", tfm.mean)
enco = tfm(2)
print("Encode of 2 is : ", enco)
print("Decode of enco is: ", tfm.decode(enco))

Mean is:  3.0
Encode of 2 is :  -1.0
Decode of enco is:  2.0


## Pipline

In [38]:
tfms = Pipeline([tok, num]) # The only thing that won't work is tfm.setup

In [39]:
t = tfms(txts[0])
t[:10]

TensorText([  2,   8, 529, 269,  12,  87, 197, 995, 750,  38])

In [40]:
tfms.decode(t)[:30]

'xxbos xxmaj despite having a v'

## TfmdLists and Datasets

### TfmdLists

In [41]:
files = get_text_files(path, folders=['train', 'test'])

In [43]:
# At Initialization The TfmdLists will automatically call the setup method
# of each Transform
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])

In [47]:
t = tls[0]
t[:20]

TensorText([    2,     8,   486,   282,    13,    71,   207,  1001,   834,    37,     0,     8, 23923,    11,    44,    14,    79,   418,    24, 16245])

In [48]:
tls.decode(t)[:40]

'xxbos xxmaj despite having a very pretty'

In [51]:
tls.show(t)

xxbos xxmaj despite having a very pretty leading lady ( xxunk xxmaj arenas , one of my boy - crushes ) , the acting and the direction are examples of what xxup not to do while making a movie . 

 xxmaj placed in southern xxmaj mexico , xxmaj xxunk , the xxmaj aztec xxmaj mummy ( real xxmaj aztecs , by the way , xxup did not made mummies ) has been waken up by the lead characters and starts making trouble in xxmaj mexico xxmaj city suburbia , during the first movie ( the xxmaj aztec xxmaj mummy ) . xxmaj in this second part , the leading man and woman want to find th mummy and put it in its final resting place ( a fireplace would have been my first choice … ) 

 xxmaj into this appears xxmaj the xxmaj bat , a criminal master - mindless stereotype of a criminal genius who creates a " human robot " ( some idiot inside a robot xxup suit ) to control xxmaj xxunk and ( get this ) take over the world . xxmaj the final match between the robot and the mummy is hilarious , some of the worst chor

In [68]:
cut = int(len(files) * 0.8)
# splits = [list(range(cut)), list(range(cut, len(files)))]
np.random.shuffle(splits[0])
np.random.shuffle(splits[1])

In [69]:
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize],
                splits = splits)

In [70]:
tls.decode(tls.train[0][:10])

'xxbos xxup star xxup rating : xxrep 5 * xxmaj'

In [71]:
lbls = files.map(parent_label)
lbls

(#50000) ['neg','neg','neg','neg','neg','neg','neg','neg','neg','neg'...]

In [72]:
cat = Categorize()
cat.setup(lbls)

In [74]:
cat.vocab

['neg', 'pos']

In [79]:
cat(lbls[int(np.random.randn(1))])

TensorCategory(0)

In [83]:
tls_y = TfmdLists(files, [parent_label, Categorize])
tls_y[0]

TensorCategory(0)

### Datasets