# PennTreebank

The Penn Treebank contains a million-word corpus of Wall Street Journal articles. This corpus can be used for either character-level or word-level modeling (the tasks of predicting the next character or word in a sentence given those preceding). The efficacy of models is measured using the perplexity of trained models (more on this metric later).

The Penn Treebank corpus consists of sentences. How can we transform sentences into a form that can be fed to machine learning systems such as recurrent language models? Recall that machine learning models accept tensors (with recurrent models accepting sequences of tensors) as input. Consequently, we need to transform words into tensors for machine learning.

The simplest method of transforming words into vectors is to use “one-hot” encoding. In this encoding, let’s suppose that our language dataset uses a vocabulary that has $|V|$ words. Then each word is transformed into a vector of shape $|V|$. All the entries of this vector are zero, except for one entry, at the index that corresponds to the current word. For an example of this embedding, see Figure 7-10.

> **Figure 7-10: _One-hot encodings transform words into vectors with only one nonzero entry (which is typically set to one). Different indices in the vector uniquely represent words in a language corpus._**<br><img width="500" src="data:image/png;base64, iVBORw0KGgoAAAANSUhEUgAAA1YAAAGmCAIAAADwM4IMAAAACXBIWXMAAAsSAAALEgHS3X78AAAgAElEQVR42uzdTWwT59o38JsCLQHXtVQ9MlIt4cpnUy8Y1C6wjJBpHmWoxMIHJ9KrSCVWvIF0gTk58TIJTpbOSTGLE9g4slMpG2OOFyw60RNhISKzKMqwMJuO6kiuhBeVXNdI5xygvIuL3Exm7LHjzzj5/4RQMI49X8n8fd1fh96+fcsAAAAA4CD5AIcAAAAA4KA50t6XQ00RAAAAoO0OHTrU5hdsOrQh7QEAAAD0aTTcdQRE8gMAAADo9yzYaASs9TQkQgAAAIA9kvkaD4L1I6D+CZpH+D8RBwEAAAC6EPs0Uc/4n81EQPX/ar6u9V/G8REAAAAAdhX4aj1IXzeYC3cRAfUhr+rfqAICAAAAdC0RHjp0SBP7+CONp8BDjRTw1Gmv1tesdgMxAAAAALQS+2qlvUPbGGMffPBBre9qNALq23x5yPvzzz/5F/Qgf4TpGoiRBQEAAACajn38kaqxj6Pwp46DtVqHuSMN5j81Cn/0t/pBtrONGOEPAAAAoMUsWDX8ffDBB/xvfRPwoUOH3r59S4/wL+pHQH3+i8fjf//733///XecDAAAAICecLvdy8vLFouF5z/u7du3/EF1fDTKl8YzvFCp73/+53+Q/wAAAAB66x//+Mf/+3//74OdDh8+zL9WFwiZYXNw9SqgZswv8h8AAABAz21tbf3nP/85fPgwxT76gqkGY1BFkDH2559/0te76AvIU6B6IAgAAAAA9Nbr16//+9//Ht72559/UoMtBUHNkGF1/tNnwSOa2Kf5Jx/2yy1OT5xxOnAOAAAAALrgb/NLck7hEZCqgEeOHKG/eU7jI0VohC5vCN7FcBDNqF56Ie6M0+FxCTgfAAAAAF1gMZv4169fv/73v/995MiRN2/eUP5TR0AeBOlxg/zH6vYFREMwAAAAwB5BVcA3b95QEzAPf3wUCO8CyFOgOt2p/3lEE/v413y2P00VEAAAAAB6FQH//e9/Hz16VN1VTz1BDCU3gy6AVSKgHqqAAAAAAHsqAv7nP//h+Y/C3+HDh9+8eaMeHaLJb1WD4BF97Kv1BQAAAAD0PAKq8x+fGoZSIDXh0lgQ4+6ARsNBqo4IBgAAAICe+PPPP//73/8yVf2PxgW/efOGdxBscF6/hqaGxhEHAAAA6Lk3b968evWK57/Xr1/z/Mcrd5oU10xfQLY9NSCOOAAAAMDeiYCU/16/fl23/lcrBX5QNfYx1UojAAAAALAX/Pnnnzz2cVVHgdQNch+oY1/VOIggCAAAALB3IqC6+MfbfxvpBah+/APjd0L+AwAAANg7EZCSHy/+VU1+jeS3D+o+AykQAAAAYC/gy3ZUDX+7mtH5AxxNAAAAgD6KgOrkV3UuaNZACe8D47fBsQYAAADYa0FQM/OLRiNZ7oO64Q9BEAAAAGBP5T9NCmwCGoIBAAAA+iwCspanbcGIYAAAAID9FhNbjYAAAAAA0O+BDxEQAAAAoI/TnmYgSNMvhQgIAAAAsB/SISIgAAAAACACAgAAAAAiIAAAAAAiIAAAAAAgAgIAAAAAIiAAAAAAIAICAAAAQL86gkMA0B2ZrExfeFxCX+/I4OiU5pH11QWc370vnpTi9yT1I/5h0T8iqh9JSxvyc0X4wuEV3ThiAIiAANBS8osup9LShvpBr+gOjvv6NAvyLAv9ZevXoubcXdh5BX556ZqcU96lwxExFgnhoAHsY2gIBuiUUrkSCEUGR6c0+Y8xlpY2Bken5qIrOEqwR0RjKZ7/GGPxpISsD4AICADNGBydiiclgyeEbyUm55dwoFqhTi3Qit//eKl5JPPkGQ4LwD6GhmCAjpicX2oknURjKc9ZAf2udiWTlTNPnm3mfs5kn5XKlTe/rOGYdMInH5/AQQBABASA3WWUaCzVcFj8JyJg4w5/PoSD0Ames6fV/7SYTZqRIgCwz6AhGKD9wrpOfoLToTz64c0va4vTE5r/yheKxu3FAN2IgC4hdTdst1np6/XVBYvZhMMCsI+hCgjQZnJO0fSjt5hN/IYaDPjk54om86XXHqPiAj3nFd0oSAMcHKgCArSZZuo1xph/RFQXVILjPs0T0tJGqVzBoQMAgK5BFRCgzdLSY80j3qEdlRXB6bDbrPlCUf1gJvuMF2A0RUTB6eAJMpOVN3PK73+8FL5weFynG2yq49/FGPOcPd25+QjlnPIwK+9q80rlSib7TH6uMMYMvkvOKbVSsvpwnbKdpKbMBjV4ZDTvrn4XOads5pStX4unPrN6XELj786/kTF26jPrGadDcDpqbSQNzt3VW+QLxUxWpg3ziu665yJfKG4VXqgfUe+m5n8tZhPf2nyhKOcU+bnyyccnLriEWntR9aSr98jgIAMAIiDAXpcvFDXZjlVbDkRwOjRPk58rPAJqlt9YX13wuIR4UpqLJjTfNXtjbCZ4xWB75qIr0VhKE54sZtNM8Eow4GvjjmeyciAUUW+exWxanJ4waODOF4pz0YS+H6RXdC9Of6e5/f9tfqnWNHXqw1X3gDR3ZDTvTu+Sycrh6Ipmq7yiOxYJGeStUrlye/l+PPmj/jqhE615pn4jBafj++kJgxxfKldo794/FGJ153lO3JPCtxK1ri7N/1JnwapnUHA6liOhWkGwyrZtv5Hv6qz6mDR+KgEAERCg9/QTwVS9F55x/kUzX/Rm7mf1/VUdLDZzSvyeVHXISPhWIl94UfXuXipXBkenqk5MUypXJueX5OdKu5Z/iCelQCiif5dAKGIxm6p2L0tLG4FQpGphLy1tZLLP1lcXGqknNaGJI3NBe0Z+rrrLtPGDhalaYynknKIJOmqnbCfVzxwPRapupJxTBkenYpFQ1Xhda+8CoUgr1V/NeGE5p9Bm6M8gPV719NXatvCtxCcfn9AcllOfoQQI0FnoCwjQ1gj4XHt7qxoF9DOulcova73m7eWUwZDhWqs41Eo56m9sfOYaYwYTXE/O/1P/YCYr+67OGnR/NEhprWv9yMg5pWr+4/97e/l+rehWK/8xxnjhs5HdD4QiVa+KuehKrW9s42ofpXLF4AyWypW/VbskAjVCbdVLCK3AAIiAAP3tTPUqoEOfD2q9gkFoeBdZdANQ9DnAbrPqi0Bz0ZW2DEMxeJF8oaipd1J1sJHXHG/gabvVliNT/4wkf6waag1eU70Nteqj+tik61EqtyvW12V8EDJZWfOEtLShXykRABABAfaJh7pCS4MjNure8i1mUzDgCwZ8+hfU3FnzhaKmU5d/RFQe/bC+uqBp3yyVK+2aktBus87eGKvaNJl5suOY3F6+rwkHHpfwm3xfefSDpvAj5xS+a+urC29+Wau6EAg9Tn+Me4+1/cj4R8TZG2P6ehWNkNDENf0ppoNGfy5sR8BMVtacULvN+vTBnd/k+5om9VK5cns5ZfxhgDay6mXTFh6XMHtjrGoTs2ZcVDiaqLVt+L0B0BPoCwjQB2hmQepc5R1ya8aLlMqVUrnC7/GJnTmAhmXwO2567bE6YWSeyK3fg2lwwPsgsjM8be4MQ/oKWeruTYvZZDGbFqe/812d3ZlpfmzjTHXtPTK8N9718cuO81c0CS9fKKrHzOoDpX9ErNoXUx/jFqe/o5eKRUKfSpd3HkxJPdm4/l0WpydoL/zD4peXrrX3suS7MBO8Mjg6pWlopsHOtTIxY+zpgzu0X8IXjkAHKr4AYAxVQIA+4BXdPE94XIK+oqO+v2oylmZCEM/ZHQWbTPZZ65unnulQP+uhetvknKIpAao3Tz93SXubDtt4ZOw2Ky95Wswmj+u0dq9VvUI1tTo6ibXG4mh2WT2eRj+2plSu8OCl7+pnt1l5ihWcjrZP+zwTHHt/MIe0L66O/lVmSlJd0pqJMwGgO1AFBOiHCDh0Tv1Pwemo1bVfPyuN8MWOfoeabojq2lU8KVVtSWSM+YdFg+ld1Nmi6jhQ/rW+ofyM8y/GuybnlLYMDW7lyFTb5XOavTBIq/q91gdlvrP6KWCM32gzp1A7LM0d2PRG7hZNb1nr6GloOgMwxvzDFxu8pAEAERDg4LKYTzT4TM3svu9uwDunmNH/L2WIrV+LtW7DF9o0mzRNwmycwzTboG5RbUUrR6baGdlF1UrfBlqrIKfPnZpopTlcTNXeqv/ejs6rsqsjoB/wjgG/AIiAAPuKZgK5xu+I7VqxQ59jDGZs6b5qw2VOGAcL9aTZ/XhkGpwq8l3o1FXyNAdE/2GA75d+B890ZmLFJuh/KIQ9s20ABxb6AgJ0O/SwatMHtou+zAa9PTL64hz6vQHAXoAqIEA76dvpGrQXCjaes6dnb4zV+i+cXAAAREAAqE7fw6lqu7C+NNjD5bD4umQel9Cu9uj9Qb1iWxu1ZTpuAABEQIA9RHA6LGaT5h6vH9Oq7x/WrvEW+jLk+urC3gl2+r6S+ULRs+Of2kEb7SpA9urI6N+Chv1WbQ7W76zmgOjX5OBXjn4kDR8svBd+LjTXfLsGegNA09AXEKDtt3ztXTy9tqFPAOpHLGZTu26HBsMF9gL94sjqCYSrRpx2leJ6eGTqLuhi8EzNAdEcLqaqH+sLyZ3rctr6EdhTlyUAIiAAtIFmDj/GWDSWUme+qG6i4DbO2auv+ujnJe4hfbHzX6pJg9UTHfPoUHf2kLor9vb8yOg/FcxFE1Wbg6mKrH4kk5XVz/yXbo5l3otU3500LW2ov3cz9/PeOe+aGSgxKSAAIiBA39MvdVAqVwZHpyipRGOpKmuFDYvtzKA7A2W+UNw7q29p5hNmjMk5hTcR6mtj+nCsT4Rz1Raf3VNHRv+pgN66agrU7zK/YPTLrNltVoNlY0rlCp/4Rr3gci8+F2l3KpOV+X5hdTiAnkBfQID2CwZ84VsJTdBxnP+26pPbPggjOO7T3OzjSSmTlf0jF6mrWan8Un6uPMzKF1zCTPBK1yPyRc3BGRydooVu9RP1qZcgI6dsJzVlv3hS2swpFrPpjNOhXjB37xwZ/4g4F01oNjstbXwqXaZTXypX/iqeo3f0D4uaDwmT80u///FS+MIR1oVdzfHxim7N9/KD09syG0V/zREIhCJUEZfRKAyACAiwP8wEr/xLetzgje17w9TSBI9L8IpuTdbJF4qa4MU3tcsH5/r4ZU3LeKlcqVoHCgZ8+pqfd8itTzMNHuoeHpnF6e98V2f1j/N94XtKHwk0+1h1CwWnQ7NqX3Dcp68x75GANRMc059lhD+AHkJDMEBHLEdCjcwAHIuEOjEussGX7ckN2GI2ra8u1H2a4HRUDWFe8VwrUyv36sh4RXcw4DN4gnrBmNTdm3V7QFrMpuVISH/QDN6ljaOOmuAfEQ2q3ZguGwAREGCfEJyO9dUFgxu5xWyKRUKaKk57Y1bd9uVSudLgWIpOHByDu77HJdR6gt1mXWyhbtrDI7M4PWGw5eqyqMVs+r/VfxjENbvNur66UPUJM8ErVR+3mE2NJMuOSt29WXXb/CMiJogBQAQE2Fcp8KcHd2ZvjGnuuxazyT8i/vTgTofynzrrrK8ueEW3PktZzCaPS1icntBPldIdHpegPFrRHxyPS4hFQsYB0T8iVo1xW7o5BffakQkGfMqjH/QDhqqGvKcP7uhrlpSAf3pwp1Zmor2bvTGmfguK1B6XcMb5F/WTH3a3g6DFbKKd4ufOK7pTd8MxXTkTALrg0Nu3bxlj9Dd98ee2169fv3r16tWrV1br+9/Re2qaWYB+kS8UKaCcsp3sSSWmVK5Q42ZvWwMNtq2JDeM71cqB7dWR4ZcEqzZhjQZ1Ddzt7176rl5dcrvy5aVrmsb3N7+s4fcGABkcneL9g7/++utvvvlmYGDg2LFjx48fHxgYOHHixMDAwPHjx48dOzYwMPDRRx99+OGHR48ePXr06JEjRz7YdujQoXfJb/sLDAcB6Aa7zdrb2zAVt/bmwWl629qyU706Mru6JJrbwn75uK6O8gDQNWgIBgCAXrq9fL9PwysAIiAAAEAdmaysn546Gkvpp7zRTyUNAG2HhmAAAOiG+D2Jpi3k6+Dp18tm2+OlcLgAEAEBAGA/4P3ZjXv+zQSvYJpAgC5AQzAAAHRcvlBsZKrFYMBnPIc2ALQLqoAAANBxdRcpttusi9PfeUX0AgRABAQAgP2CJsTOPJE3c4q6C6DHJdhtVs9ZAf3/ABABAQBgH/KKbhT5APYO9AUEAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAAAgR3AIAKA78oXiVuEFY+yU7aTdZu3mW8s5pVSuMMYEp8NiNu2RA1IqV+ScwhizmE2C03FgtwEAEAEBoLq0tJFee5wvFL+fnujf+3TinhS+lWCMzd4Ymwle6eZb/21+KZOVGWPrqwsel7BHDoicUwZHpxhjHpewvrpwYLehLQZHpyzmE96hc17RvXdSPgAiIAA0LxCKxJMSfU2lLIB9+TknupwKjvu8oruJb6eIn5Y2hGXH+uoCUiBAXegLCLCnxZMS5b/ZG2PrqwtoqoN9qVSu+K7OZrKy7+psvlBs4hXWVxdmb4wxxuScMjm/hEMKUBeqgAB7OwLekxhjHpfQ5ZZTgG7aUsW+38sVxnbdVdTjEqiJP3wrEU9KsUgIRxXAGKqAAHsaNW9d2DPd1wA6QXA6ggEfYywY8LVS6vacPa3+wQEAA6gCAgBA7y1OTyxOT+A4AHQNqoAAexp6tQPgBwegE1AFBHhnLrpCX+h73UVjqd//eFn1vzpNcDqqNmnFk9LWr8VTn1n9IyJtYXptwz8s0j/5c9Jrj0vll3ab1T8s1poMRf00z1lB/QpcJivH70n5QpHm3aj6nN3i73vBJVwfv6y5Z2eycubJM8aY5+xp9ZbXelyzkbQjtXJAPClRJ8uqb70rDb5pJiun1zY2cwpj7IzTcX3cp58ZMV8ozkUT9FLBcZ/B5DUNntkOnbhab2Swg/lC8fZyiv7XO+SuepSq/gDSg3Su+YsY7AsfSoKBUwCIgACNepiVKWxpskWpXKEBhj0ck5EvvNCGgHtSJit7XIJXdA+OTtHsvrzLYKlc4Q8yxjKMxZOSf0TU9JGv+jTGmObmOjm/FI2l+D/T0kbmidxid3v1TDeZrPwv6bFmIo/Mk2d8EsEdEbDG4+oXpI2Unyv6jSyVX3556dr7Xa721s3tRa03LZUrvqs31Tk+k5WFLxz2nQc5npQCoYj6pWKRkD7lNHhma21edDnV9glT6u6gZtcyWXkuuqIf3k6nVRMB+bnOF4qa41P1Itz6tYhfZQANQkMwwDs8P21u318Jv932ZEzGGadDXdvQ333noivyzg1WpwT/iDh7Y4zutfGkxAstxHf1Jj3NK7pnb4xR4NDMyhaNpSj/eVyC+qXUoXC34skf40nJbrPyDCfnFPUNfrfmoiuUdQSnY/bGWDDgs5hN3qFz+mdOzv9TzimC06F+6+bmEGnkTelcUDyymE00apW+2BFcCi8CoQg9zotngVBEc94bP7Pq/Ecnjk6rnFN8V2+2N/8Z7yDPf3ablY4S/67GJ7nczP1ML+JxCTw48vmS9DwYPgXQAFQBAd7xDrmp5JB5ItONiped+BO6v1XGBRs5p1AgCAZ83iH3KdtJxtjt5fv0IF+E4/r4Zcf5K6VyJXwrwds95ZxCd251DWlxekL9jvlCkeKRV3Sn7obppb66dI2a5NRHaVfyhSJ/wUxWptUp0tIGhbMmXpDyqN1mffrgDq8kVT10+UKRV9f4W8eTkmbH2/WmPKALzvfzFZfKFc3T8oWi+gk8VN1eTqlHSDR4ZmnXeP7ja37Qy2ayclraaG765ao52GAHeQXdYjb99OAOPWgxm8K3EvTppcHxH2lpw2I28cIhL0tHl1Oda9oG2PdQBQR4h68eqymqPdyucBikk8OfDzX3py1TV8QiocXpCV5AorujxWziDWrqr3nhJL22QV/4h8VaifP2cmo7Gn7Hn3B93EeppemNt5hNPHR6XO97H1L/vN3KZGWqJ/lHLtaNzl7Rzd9OXVLSnPS2vGm+UOTnQt38WnXbliMh/nhw3Ldd/VL0obPumaVsRF98r8pY/Ov4vR/b8iNTdwfjSYmOkjoc869r1fCqUk8Ww3d5t2cNABABAarzuE7TjU3dAEe3Gfqv7qNegAZt0OoIpY4mmjKPVzynSX5cqfyydvXlMYVjddd+vjG8PrpbmlVceeJ52FogbqRhUdNQ+9ftw7LZbJgweFMeoGuVJDm7zar+gMHPnTri7OrMpqUN/cvy80j/27q6O8g3iW+kegdL5UrjnyLGdn5Q4U29mlegeQG3dH1nAUAPDcEA6vuHQHdHOafQzVLOKXTf9Zw16l305pe1tm9MqVyh3k4Ws0l9/9Nlmh2BgMcyu+2kJmRobpnCF+/CQXQ5VbVZkEfhMzvLnzxV5Ju90Wq2relSnObb40mpkbBV9XEa8d3eN+WJVhOA9E7tPCBV82XjZ5Z/oX/ZU7aTdE5pLFGLl2jdHeRbojnsfBcyT541uBm1Tpz+E5HgdFDX0pngWIPfBXAwoQoI8J6qvvXu1rXZo7EgpXLlU+Hy5PwS9eUyuJNp8hkPDXyZBPXdUf0cr+iml81k5UAooq9mqUsp1IeM/+EZsW3he+e27YrFbKI6KA0yaONWtfimlGgtZlNbgkjjZ5aXdfUXrWrM+MvWN8l4B99vsG4z+C40cbrrWl9d8I+I8aTkOP8t1ggBQAQEaAjvDsiTn/z83U2uy9OMbRWKdOOked0a/8ZGGjR5sS11N8x7ZamnGtG8FP2v+k/ndry5QuDi9ASdIDmnfHXpWrsaOlt507ZPUNf4maWLts4zn7fai67uDjZyKjc70JkvLW3QueA/zgCACAhQH/X548UDaurqfkdAGl/59MEdi9kUjaU6VMxQj+KUc4omBe62bbSH1MNFS+WK7+rsroYadOJN9313tL25gzQGuVSuxCKhpw/uYIJoAERAgIYj4Nn308Xxv407AnY0CFIvPf0Yjja+xU/bd8pas7XN3hh788ua/g+fbWTvpEA+MiYQinShEbAnbwoGMtlnpXJFcDowWQxAXRgOArCDVzxHM5k93B6AyRrozt902+j3242JtVDH+cbby+w2a6bec07pxhOsry7QAg+lcuX28n2adOPUZz3oSt9K2YbmmqG6KWMsHF1Z73wPzlpvynekXd3dGj+zjZy41k9u3R2sNcZFs1PtPR285wZ+lQEgAgLs+kZrt1nzhSLvLEWP1Ks9NFn7aXt3eD7WcjOnaLrh80Ze/e5YzKbU3ZufCpcZY/HkjxQB+dM2cz93+rDzbWv95r04PfEwK9PE1/lCsTtjQvVvynekXXPXNX5mDU4cf6T1w1J3B/lb6J+w+X6DT+J3DgAiIMBe4XEJ+aSULxTp/tTIpBVNN4nWLXrtNiPyqV40i6WWyhV6qVrvSHOtUYLRbFvbJ+DVzCbDt019qD/5+ETTr/9X8Rxt81bhRdemBdG/KR1PxlhbVuNo/MyqZu0p6o58Owep1N1Bmp+FtlAd7vku8J1qr04MNAZABAQ4ABHwrBBPSjz3NNIRsHNrku52yCQfuZKWHqtX3+LjVRuf3YbGQcs5hRYCMd5HigKbOcU/Itat5GmKplW3jU928zAr89UgWFdKkmx7Psh8oXjG6Wg6LXmH3LSntWZe7NCZ5SeOzh0PwflCkS/m1uA5Mj6hdXfwgkugd0xLG+rOeTTlOOvAQCvKlFg1BKARGA4CUD3PlcqV7XVBer/kvMV8ouFnvpuvLl8oqqcp4TPLXN9eikNTKeFLNajv5Xzdjr/NLxlUVmg08eDo1O3lVCMtuXxhMXrfuWiCvlbPgM2Dl7owSevb6l9Qs23/kh6znWtI7BbtTiAU+aT27tR9U694jo5GJivPRVdavgYaPbPqE8cX8GCMJbafGRz31U3AjZzQujvINymq2gx+Qhv5tNC5nxQAQBUQQIt3ByyVK410BOwoiqFnnH9p/FtmgmM0O0kgFFksT9ht1vg9ieJdMOBTLSbxLLqc8g+Ldpu1VH4Z3s5h6nzgHxHpeykTBMfffftmTpGfK3ydXx7RGoxcHpcwOb+09WvRYjbFkz/St6u3jakaphlj/zv6d//IRVouxWI26cPrV5euXR/3UeEwupyigxYM+Jo75rxp1eDsN/Kmdps1GPCFbyUYY+FbiYdZmQp1m7mfg+O+JuJpg2eWTtxcNEFR22I2ec6e3swptCWNjJZt8ITW3UG7zUqzNMs5xXd1NjjuK5VfBkIR9m5147GO/uBgRhgARECAXaPugGwPlACb6NVkt1ljkRAt+EG3W75TOxoQ1x6rl/rg0UGzy6m7N2m+QFp0S/0u7+s624up+IcbmokjFgl9dekaLwRSLlG39pLvpye+vHSNEgnlDJqEhR58vyPSRr5QpHHc6he8Pn65uWOeyT7bPhoXaz2nwTedCV4plSu0p+qj7R++2NyHk0bO7PaJC9MUP3To+Cuk7obrH4GGT2jdHVycntjMKXJO4TM2s+1h1B39cIXugACIgADN8A6do0KId+jc3txC3lWualOaf0S026zR5RRfJiE47tPUfqjax1vlvKLbP3xR36PLYjY9fXAnGkvF70l8QTCP67T6yFCHRbvNapyYT3327gk0Dc14KCLnFIvZFAz4ro9f1u8IzVwdjq5QsPCPiLTqK70In9bEK7qDz300IJe+yz8sakqAtQ4X3yT1JCl8MLjR0swNvCnPQN4hNz8XdPQo/fBW4zO6elWtI9nImeVH76cHd+aiibS0QfVs/8hF/ZtFq0cAACAASURBVHGuug0NntC6O8hT++3l+1TutZhNXtFddfXequ9VawOMr38AaMSht2/fMsbob/riz22vX79+9erVq1evrNb3P6vrqwt7oWsUwAHx5aVrck7xj4i81XWvoeWMGWOzN8b0lbz+Pea0OvMBvOT6+oTGkxLVR3+T7yMdwr4xODrFS+xff/31N998MzAwcOzYsePHjw8MDJw4cWJgYOD48ePHjh0bGBj46KOPPvzww6NHjx49evTIkSMfbDt06NC75Lf9BYaDAOxpfxXP0Y1tzy47wZtNx4b3w3oMfBiQf/iALi/Rvyc0k5Wpad7jEpD/AOpCBATY066PX6Ze7YOjU4c/H9qDQZD6jVED5b4JQDSO4YBGwP48oYc/H6K+jxaz6Xtdz0gAQAQE6DPUlUoz3nNPoTneOjq6s6u7s7avdudAnVCaN2d9dQFjgQEageEgAH1wY1ucnljck4WNUrkyExw743TsjxIgY8xzVvAPiwe2x3P/ntA3v6zhdwUAIiAAdC+e7rMG0wPb/rtfTygA1IKGYAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAIC96AgOAcBekMnKHpeAfem+eFLa+rWoeXBsWLTbrLgs94G56Ir+wZngFRwZAERAgN6bnF+KxlKL0xPBgK/f9yWelAKhiH9EjEVC/bHB96RMVtY86Dl7GhFwfwjfSiACAiACAuxFgVAknpQYY/JzZR/sDu0F7VG/pEDGmMclXFBVLk/ZTu7ZTc0Ximnp8e9/vPzk4xMXXILgdPTgLOeU9NoGY+zUZ1aPS+hJXE5LG3SxCV84PK7TFrOp6tNmb4ypDt0LujIBABEQoMfmoit0T+qjspmxxemJUrkST0r9lQIvuITGK0NU6WSMvfllrcvbSQVj9SP+EXFxeqJWAGq7Urniu3pTUzftcgFbzim+q7P5wvvme4vZlLp7s2r3A/VpzWRlREAADsNBAHomk5WplWrf5D8Si4T8IyJFpbS0sf9SO+W/7uP5z26zekU31f94Hu2OwdEpyn+C0+EV3VT/0wfTzskXioOjU5T/PC7BK7otZlOpXBkcnZJzCn6rACACAvQBunMLTsd+yn88BVJACYQipXJl30T2wdGpqn3LuhN9KGb5R0Tl0Q+pu+GnD+4sTk8wxtLShr47YyfEkxLFrFgk9PTBndTdsPLoB6/opmTcnRM9F02UyhWL2fT0wZ311YXU3bDyaIWS6N/ml/BbBQAREGCviyclqmQs77v8R2i/SuXK7eX7/b4vVHmiApjFbKLQ02Vz0QRjzGI2UewjwYCP0k90OdW1bfC4BKryksXp7+hEd6GNNV8o0rsEAz7eCdJiNs0ExyijoxAI0Dj0BQRoQ5jb+rX4yccnqnaHok7r+v+lu6l/RKzbnV/OKQ+zsvxcUXd+YozZbVZ9+TCTleP3pM2cslUoCk7HBZdwffxy1Y5i+ULx9nJqM6dsFV7kC0XqR3XG6fAPV9kk/uRMVhacDnqa8cwvgtPhHxHjSSkaS/X7AEy7zUrZIhjwzQSv3F6+3/0GbnpH/4ioOZvXx32T80tpaYNqY53bADn37goMjvs0B8crutPSRvye1OkegWnp8fZeX1Y/7h8RJ+eXSuVK/J606JzALyUARECAbpCfK7yFTn8PDkcTck7xim713bHW3bRa3WWlVsvjlm4YpqZLViYrZ7JyNJZaX13QpDp9BzJqScxkZeELh/GT5Zwi55R4Upq9MWac7WaCY/GkVCpX0tJGTypnbZS6e1NwOro26kIfwamZ1XNWG7vPbJ8sOad0dDrGze0Cm8d1WrcNf0lLG12owL0bAlztRAhORyYrb6IKCIAICNA13iE3Ba+0tKFuIKM7N90XvUPndhQz1jYYY3abtcEZPSxmk39EFL5w8Nk3eCbgorEUbYbHJQTHfRbzic2cQj20Bkenfnpwh39vqVyZnF+iu+ZscMxiPsFfM/NEPlMj//GXLZVfRpdTNJZF+MJhkO1oB+WcknkiNx0Bq07tq8uaHa8y9nay663Ci+0r4QSdQTmnUBLiG7bZ4QjIJ9Cm+JUvFLcKL+gdPWdP8xja0Ulq6IMTz3/0w0XveMEloCEYABEQoKs8LoHGJKbXHmsiIG+30gSgh1l5V6lCcDrUPcAYY56dTyiVKxSVvKI7dTfMN+yCSxgcnSqVK3PRBG81lnMKxcfU3bB6RjcPY5rt52FR/bL0zy8vXZNzSjiaMM52fxXPyTmlldpMI8Mv9v1Mv6XyS3USpWlZLGaT8miF56Hf/3jZ4W2o8A0olStfXbpWKlc8LmF9dUHznE5vA83gSKNzGGPqavS+GXsEgAgI0B+8opsmQNH0x4rfk+h/Ne1WdKOyt2/+YWpvZdt989XZMRjwhW8l4klJP3vcVuGF8aS+/GX1nQ5ng2O+q7OUJg2aR099ZmXbrczNUU/te2Bppg2n40m1wK6VJ9U5nn+K6M5IZPX7vj8IT57xD1RY7QMAERCgN4LjPhqoqG4Lpj5zjDH/8EWDO1lbZJ68m6pNH+nUjXTvmu1cgt1mpVGu/hExOO6r1XhHL0tlTs1/8eZj4xTS+roRuLtXPar5QtFiNvVkaRCmWj2lh+3j9OmCbRcFAQAREKAHKHvlC0V1WzCVAGm8ZKc3gBoKq1bj+E068+QZ/zp1N0zrK9AyHoLT4R8W9cNZ6GUzWfnw50M4y728wL5wbJ+RisVs+r/VfyTuSd4hdzeHp5xxOjJZebuAbV1fXcg8eaYZmduFHzRegKQftFK5sg9W1gZABAToY/6Ri+FbCXVbMHUE9I9c1D+Z+g72NrMqj36IJ6XocoqqlZO5pbnoykzwiv6GardZDdbM7dUg2QNFU3O126z64iivinVsG0xMVcD2uAR9CbDTayvTNvAmaU3XVdaOqjMAIiAA7M7YsEgDF6gtOC1t0OjFsWGxagLjBZVO47MJ6iOCf0T0j4g05x91+5ucX9r6tagZeiI4HeqxILtCHbZauTE3UoDs/lq9XcajFR8X8v4Ib/fG63T64dePvvcn75bX6W2w26yZGmM+aIhVpzMowH6C1UEA2nZzopoErdNAf/NFVDVo4pWH7etKTy9YtYshjwhnavQbs9usi9MTyqMVarCOxlL8Fkvfksk+a3rDNnM/s+1pO6CVq4supPTaY90Rfl+W6+g28NfXT4tNZ7kL/QJpWkQ+raYaXfzoFwiACAjQA/5hkW5F8aREwavWzM8Gd7LmeIfcjLFSuaKeGppQGK07B6HFbOJby6Mk7RGfcaYJFB/1Exo3bn11oe6fPXtJyDnFcf7bw58P0fQlLZ1i8RzFL00N7Pb2h41a3xhPSp8Klw9/PtT0SeQxlC4h6uTK5QtFCoV0EVY1Ob90+POhT4XLLa6qwnczsXMb+NB1g20AAERAgA6WSagQwudSrlUX8bhOU1Oa5k7W+lurFwgplSuBUITynHqyGDmnTM4v6UuG/NbOw6LgfDfzc/hWYnJ+SRNY697O39+YxXMt7prxnz17SYSjCTpomazc4hK618d9/JzyFBgIReouM0Mrp9FJbPEjB71LJivzNFkqV3xXZ9n27OW1cjBdk6VyZXL+n61sAH+X8K0EL2/T9UyXCurNAI1DX0CAdvIPi/zOZHBXtphNNJVgNJaqtYbvbqXu3hwcnaLb4Vx05dT2sraMMf+IqK4SbeYUWkpEPasIH2gZDPjU2xOLhAYLU/LOb6FlhRlj66sLBgmMr4O8DzrpV12pj/dTrHUc1F33+OoazaEloQOhSFracGSvqM9CMOAzOAvqqmHdmSDrXN4jYuaJHE9K4VuJePLHU7aT/LKJRUK1LmP1BrRe9l6cntjMKXJOGRydovVR6CfOYjbpZ68EAAOoAgK0NQJux526c8HMBMdoXLBmrd5WCiTrqwt8pgzKf9TPT3Nr5C16pXKF1hGmsSl2m3X2xphmLIjFbHr64M7sjTHaL/oWupHbbVaD8BqNpRpcB3l/fyTgX3/y8YnWry5KWuqzoD9l+u9Sn80WtyEWCdGY8XyhSJeNxWxK3Q0bXO2C8/2q061vAF3nFHnlnEL5T3A61lcXMBwYYFcOvX37ljFGf9MXf257/fr1q1evXr16ZbW+/7ky/tAPAI3jhaVYJFSrEa0JPP8ZTx3Mn8bvrHUb0XjJhzF2ynbS4I4r55QvL11jjAUDPuOA0nODo1OZrKxeZKy90tIGNZUqj35oS0ahE7eZU844HVQGq/st1D3AbrMqj35oy07RAsG0DY3cEUrlCi1qRym2LdtAXWlL5coZVcQ0xteU2/dDyGGfoV9T9PXXX3/9zTffDAwMHDt27Pjx4wMDAydOnBgYGDh+/PixY8cGBgY++uijDz/88OjRo0ePHj1y5MgH2w4dOvQu+W1/gYZggJ6ZCV75l/RYzilUCGxXCrSYTY3clRt8mlqDN1pqpKPnH/C1PTJZORxNUBRuV42KTlzj545m/2a6xQNbQSOUG9yGUrlye/k+LWo8E2zbcn9Cw8kPAKpCBATopfXVBerA194U2EOU/3j74IGdODqTlflYDY9L6EkUjsZSc9EV3sWzC6vU6AVCEQqgFrNpcXoCbbUAiIAAwNh2xyZKgfF70j6IgNHlFOW//uqbFb6VUI/2aL3HyynbyXyhaLdZr4/7erWCGfUaFJyO76cnetuBxyu6F6e/68n1gLUNARABAfZ0CpycX9rjfeYatDg9USpXDMaH7jVVZ8xufeOp411vQ7BXdP8m3u/tiZgJjvV2oC46rwMgAgLs6RS4b+azoPbf/sqsHXrlnhdB90IK7/lB2MszhwP0FiaFAQAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAaBOsEQzQe/lCMRCKxCIhzYKqaWlDfq6UypXOrWNbVzSW2vq1KHzh8I+IjTy/VK74rt78fnpCcDr64uAPjk5lsrLmwfXVBY9LwJW5Dxz+fEj/4Jtf1nBkAFAFBOgxOad8delaJisHQhFt/FpOhW8lNnNKDzdvcn6JUmCDzw+EIpmsPDg6Jfd0swEAABEQYE/nv8HRqVK5YjGbvu9dqa+NZoNjFrOpVK70VwqcvTH25pc1/qdqCbBUrsxFVxznvz38+dDhz4cGR6fS0kYPr5y56Eo8KXX/rTNZ2Xd1lg6C4/y3gVCkVK50eRviSenLS9doG768dG0uulLrmerTur66gN85ABwaggF6plSujIcilP/WVxf6peXUmOB0rK8uUK4dHJ1SHq1YzKb9cbI0oTaTlTNZefbG2EzwSq8+OXhcQoMN9G3MXupydb5QzCelTFb+v9V/aLoxdE4gFFFnXzmnyDnlYVZGwgPYFVQBAXpmLrpCkWLf5D91CqTYpG/d7lOT80t0smZvjD19cGd9dcEruhlj4VsJfVfCjorGUl9eutb9whvb7rRKpzh1N/z0wZ1YJGQxm/jj3cmglP+8ont9deHpgzuzN8YokRvUAgFAD1VAgN7IF4rRWIoixX7KfzwFzt4YC99KpKWNTFbu96EV+UKRYoe65udxCY7z3+YLxXB0Zb0rOxhPSnPRRL5QZIxRa3vXP7Qk6K3XVxeouEuXLnUA7c6Jpm3wuITU3TC/2PKFF/GkFI2lro9f3h9VZ4AuQBUQoDfoTma3WRtsRsxk5WgsFU9KbexgVypXMlk5npTmoit0CzdIFXJOoecYP42bCV6hlsFw/9dmbi+nKPpcH7+8cx/f1Z8olnU6/wVCkXyhaLdZU3fD3f/YUCpXKAcHAz51zPKPiHSi4/c63jExLW3QoZ7d+VNDJ6JUrvSwdyZA30EVEKBVvquzaWnDYjb9Jt+v9b+C0/H0wR393ZRuXcbknPKpcFmdujwuQT/rCk1+oe+aVvXxUrkyOb9UdTDB4vREMODTPBi+lQjfSqgfCQZ8daeqmQmOUX2Igkv/nmIalO1xndZUmLyim4UomjzWH7T28opuYdnhHxb9I6LFbIoup7p8EPhnD++QW7dt56KxVFraYB1uDZafK5TFNeVGu80qOB1yTkmvPe5y50iA/oUqIEDL9+ahcxSqqtbnMtlnjLELO+9YvFZB/cmMlcqVUrlit1k9LoFiX+uzrviu3qT8JzgdHpdAfyilnalRXqL7Lr/1RmOpyfkl43ehsEIJqa9PMfX2O+P8i/6Y0EFrfNKcplnMpqcP7mgqcF09CE+e0Rf6AqTwhYNfqB3dhodZueoG8Ou2C+VYgH0DVUCAVvFUFL8nLTonNFGPbor+YXHn3VSmb2zkdm4xm1J3b/J3yWRl39WbpXLlb/NLzQ2BpCZdVq3gly8ULeYT+m9R1/xokEda2ojGUtfHfcblPa/ojielzBO56SJZ1al9Nboz0++pz6x09P42v2S3WRenJyxm0ynbyXyhuHlgJkHkp3tyfmkzpwTHfV7RzR+Uc0oXugPS2/HBRrPBMcHpsNtOMlWpEgDqQhUQoA03JCpL6Gtd6bXH6ieokxbTlQZroUKdOnFSk24mKzd3w+OlGn2Tmd1mrZpK1Q9azKbF6e+2M+7jegen72/MfOO3uzYmqANlT+bk66HN3M+MsVPbJzQaS1Wdz7wL54IuqnhSSksbaWkjHE3gtxAAIiBAbwTHfRTs1FmHd073j1yslSqaw6Nbeq2Zzu90F2eM0fCC5lIv5aHf/3hp/EzP2dOsteY59dS+tf509ORqGjdL5ZcH8yJX73hPpqTRvG/daw8AEAEBOo536VMPiuStwGPDonGq2C2L2URlRSrMNBHgqFk2LW04zn9LA1aay5HNbUB/0ZRFg+M+Ov4HbdiBuocA7xXa5dWr1edibFikovVsA2OqAAAREKBTdyZKgeqGUWoFVveUavu9sOmK1OL0RCwSog1LSxu+q7OfCpfnoiu7zaYHoSTGG/Gpw59XdP8m33/64M5Bm3+ORsPwAvb66sKbX9a6nIPVn3zsNqvy6Iff5Pv7b1pNgO7AcBCA9vAPX6RJy+ScIjgd+ULxXSvw8MWqd7Ked4/zj4j+ETEtbaTXHseTUqlcCd9KxJM/tnfOudZHaDay5EN3lmir2vK4VXjBGu7WuQ9U/ZDAz3J35gCv+sEjX3jRtQ0AQAQEgPeo2pcvFKPLqVgklLgnMcbsNmvVaV/sNqucU1ppRaUhva0nD6/o9oruxemJ28v3o7FUvlD0XZ1VHv3QrsNCE7m1cmPWzEfYkwjocQmZrKw/X6VyhdIPDRbe36hbJ2OMPuToz3IXpn684BJqjYLazHVpGwAQAQFA6/q4b3J+KS1tlKYrtPibfiDI9t1USEsbNGVgE3jXPZqPjcfKfKFItZDdsphNM8Erpz6z0gAR/T2+aQ/fzajX/KvthbqOd8idycrUuVPd/stPhMFGxpPS1q9Fz9nTvdoRmof89z9eUue5Vk4ErUpXbfKjx8YHIV8oJu5Jn3x8gk8V2fSJCN9K0EAr9YcrPhLLcxZVQABEQICu84+I1J1ucHSKsoJmPTF1MYNtDxluZHZozd10cv6fTFdi9LiEfFJKSxv5YJ2lOKgtT38nbnvntrbcmJub+7DNEVA8R/Ngz0VX1PMj8vVqax3wuegKr2Iqj37oSY3Kd/Um1YzjyR9brO/SLI/xpKSeD5KKx0w3+aX6evvq0jW66uTnSiwSanoDBKeDPuqEown1xU8/EbxLLgAgAgJ0lcVs8o+I0ViKco/BQg6C00HdAaPLqbo3ra3Ci7noiufs6VL5ZeaJTP32GGOaW6l/WKT/+urStWDA5zl7Ol8oVl21Qs4pg6NTXtHtOSucUY11oF53+lkMm8bXQe73GzONoY7GUtFYqlSu+IfFUvllOJqg6PN97VGxVAQliXtSiw3WVFDkVwW/NuiRWi+e2d6GfKGYycqtFCNngmNUCv3f0b/PBMfsNmvmyTPKuF7RXeuV5ZzCexDGk1IrEZAxtjj9ne/qLF3DwXGfxXwiupyicuxM8MpBG6MDgAgIsFdcH/dREzCrNheMWnDcR+vn1i0E5gtFTX84i9kUi4Q0d1yPS4hFQoFQhAZ2GLwgrfRF0+rqU2zqbrgthyJfKNLkydfHffvgzC5OT1CLqmZS6FgkZJCYqy610nwEvCdlVJlSc23UioDUetuuKLy+ujA4OpUvFNWTQgtOh0Gwa28s84puus4zWVl9NPwjYqeXaQZABAQAo3tkLBLa+rV46jOrcauff0SkO3ogFFFcK1Vvk99PTzzMyvJzRT2u1jvkrtWhyj8ielzC7eWUer2yCy7h1GdWdV6cCV4RvnBknsjqp1nMJ7xD57yiW/PKszfGmGoowPv3GhbplWvtoO/qLFPNQbgPxCIhz1khfk/aKrz4xGw643QEx33GFdNYJBRgEYra6o6bzaFjvtvvWl9dGA9FqDLden1XcDp+enBnLprgfUa9Q27jU0wBkb6lLQVm/4hot1mjy6l8ofh7uSI4Hf7hi2gCBkAEBOixxmdK+3564stL16jv4Prqgj7VUXvxbjNoI7P10kDgRl6wVm3JeDcD25mjxVa/PXhydzUTnsVsoqE/bWkNb24SPuo/J+eUFodiaD7n7OpbvKKbegUE21QS5nNTA0DTMDU0QM/w5jPq2NSrRbfaLhCKUFPp4vTEgb1PUzv44OgUjSPpSRSmlXwd579NSxsWs6nLK3mQTFaenF9ynL+SLxQ9LuGgLakCsJehCgjQS3RHpJqZ7+rNvTD6tV35r7/6Zj3c2ceuxflTGGNbhRfUW45qZj2Jwum1DeopSP1EezJUIhxdoR57wYCvOzN4a6inFm9uyiQAREAA6GwK9A/vhwIJDUyevTHWk/t90zRjCzxnT7cYAani5Tkr6LtXdo13yF0qV2gbeng9XHAJ3iF3r5Zxa2RqcQBEQADoWQo0mFuuv3hcwtMHd/po2dbvpyf0TfBt2f6e94MUnA7NHM69+oTTQ/ugsg6ACAiwn+2nha36KP/13dZCE59JcBAAqsJwEAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAA6BQsEAewO5ms3OUlp+Sc0qtFzPKF4lbhhebBurvf/UPUinhS2vq1qHlwbFjcT0v2HWRz0RX9gzPBKzgyAIiAAI0qlSu+qzczWfnpgzvdyWS+q7NpacPjEnq11H3inhS+ldA8+OaXNeNEFQhF/CNiLBLqjwh4T8pkZW3MPXsaEXB/0F/AiIAAiIAAu8t/g6NTck5hjG12qyxXKr/suwMlP1coCDLG+iUFMsY8LuGCqnJ5ynay1mWQljaoaug5e7onxc58oZiWHv/+x8tPPj5xwSX0qkLcc2lpgy424QuHx3XaYjZVfdrsjTHVoXtBVyYAIAICNMp39Sblv1gk5B8RD8hezwSv8HrJXHSlakFFY3F6olSuxJNSf6XACy6hbmUonpQm55dK5Qp/RHA6UnfD3awXTs4vRWMp9SP+EXFxeqJWAOrCh6LZG2NdLqrJOcV3dTZfeN98bzGbUndvVk3k6m3LZGVEQAAOw0EA6puLrlBb4YHKf03jRymelNLSxv7YqbS0EQhFSuWKxWzyim5KG3JO+d/Rv6tDYXfyn91m9Ypuqv9Ry3uXj4acU3hRvMvyheLg6BTlP49L8Ipui9mkLtIDACIgQNtuOVT9CgZ8yH+Np0AKKBSb9sEeTc7/kzEmOB3Ko5XU3fD66sLTB3fo8ri9fL871yHlP/+IqDz6IXU3/PTBncXpCYqn+u6MHVIqV+aiKz3MW3PRBAXxpw/urK8upO6GlUcrVIj92/wSfvQAEAEB2nnLYYzZbVZ0Id+V5UiIEkN3ElJHxZMSlZ2+VzW5Ck4HfSTQtMx29Dq0mE0U+0gw4KP0E13uxjZEYynH+SvhW4lSudKTj0P5QpFacoMBH+8EaTGbZoJjjLFMVkYhEKBx6AsIByrMrTDGhC8cXtFd9fb2+x8vNf/LbzkzwbG63a0yWTl+T8oXinQfolvUBZdwffyy/ntpYEF67XGp/DKTle026ynbSbvN6h06p9m8rcKLQCjCez7ZbVb/sFhrIAJtw2ZO2SoUBaej1rvTBsSTUnptQ84pp2xWu80aHPe1cXwDJaR4UorGUv2entNrj2mPNMcnOO6LJyU6lVUvqnZug7TBGPOPiJqzeX3cNzm/lJY2qDbW6UNRKlfsNmssEvK4hO73q0tLj7f3+rL6cf+ISN004/ekRecEftcBIAIC7PAv6THNsae/W5fKlcn5JcaYusTCbzkWs6luzSMQimjuiNQ2l8nKY8Pa23YmK6tTHWXNfKGYYYy6mmkqH3nVK2cYiyelqr0SNWMFMlk5k5WjsdT66oJm3KimQ30pV5FzSlraaO9kLjPBsa4lpI6iTH9Bl4/5UZWfKx3dwXyhSO3pnrPabTjDtyGndHqEsn9EPKPLwV09Ec/ffbjSh13B6chk5U1UAQEQAQH0/iqek3OKnFPyhaJmFCcfteAVz+14fG2DMVb37s5HGnpFd3Dcxx/fzCmZJ7L+vXxX6Z748QAAGvFJREFUZ/k91XNWoCeUyi8zT2T/sDbYWcymYMDnOXuaXvD2cipfKAZCEY9LUL9yNJai/OdxCcFxn8V8YjOnzEVXqKf8Tw/u8CeXyhXKf/yVS+WX8Xs/pqWNeFLSNDW2wm6zCk6HnFMyT+SmE1LVqX11WbOzVUbKyjx2ZLIylWzpaGey8mbu545uAJ+g22I+QWeQPsxYzCYeyDY7HwHVb9cTmhOhKbejIRgAERCgOu+QmwZ2pKXHwYBvZ9R719KniWtUydOXXrQR8MkzSjypu2H14x6XoHmjUrlC4zctZpO+OFc1JwlOB484HpfgFc85zn/LGLu9nOJZjTrp0yvwbaC57gZHp0rlylw0wct7t5fv061UvQFe0U2FzGgsdX3c166JTih2t1KbaWQmmo5GQD6chVI4nxxnfXWB56FOz+DIX5/ekaYot5hNyqMVnod+/+Plvv8RpnNB5dhMVh4cnWKMqWel2R9jjwAQAQHajBJevlCM35PUyYxaKhljmgocv500mIdK5Zd1+2NRny3GWOruzeYm9bXbrNuVp/e5itpbGWOL099pdjkY8IVvJeJJic8exweWajZgcXqCNk8dLlt06jMrT9LNUU/t2xOawtLD7X3JPHnWtZIYNYBqPplQLbCP1uJr77mgz110RjBUCwAREKAO6juvaQvmffg0vesab1QaGxZpmKTj/JVgwGewwiyVGynGNb0X1OalLnhknsisWhWTbdeu2HZfMTmn0Dd6h85pnmkxm9reoar1auJeu7vbbdbMzgPbk22gdvwDuzQI/3TBqvXRBIBGYFIYOFh4Vz8+tJAxFr8nsWpjLXd1S45FQjRFbfhWwnH+W9/V2arjJalFr9b6Y7uiTqj0slW3n2dNqprw4Ei9yvThkrVWtNt/eMyig7w4PbE4PZG6G+5m+U34gm9DhTH2f6v/mL0xtr660P1FQfbCuaCDQEOXFqcnUAIEQAQEaCir0Z2bYh9jjM/hUnUcRuOv7B8RlUcrszfGqO5Fi0k4zn+LONXv3g8+eK6w7dE5ml6bnV4jjud1ulZplkpNCZBXxfb9ueBVav+IqOlr283F+gD6HRqC4cDxD4s0cpDagm8vp1iNlllN+aeR+xMtqkuT89F8woOjU6m74R5OicJnfmkkIuQLL9p7H+UDZZp+hcOfD9V9zptf1jr9yYFPy6LdwazMGLO3o6xrgJeN9Zci/4xxENIPtcJXPRHUR/NUh08EwH6CKiAcvAg48q6j3u3lFE2PzBi7Pu6r+mRKgdTTrnEelxCLhJRHP9AbhaPvx7TSLG6ZrKyeFLB19LJVOy/yiEDP4bl2s/aT29jDjGZL6fcua/TxQN15gPAD3ul+gXab9V11ee2x7ggr6o3c32hsPn1+q3ou0C8QABEQwDgFXmSMxZMSDaQ1mPn5Qo17f4O3bXojdTLjzc205my7eIfcjLFSuaJfrCy6XeakHMandqMErH5mWtqgO6t+pAhj7JOP37VF7mrejUz2GWtgVh0D66sLdf90+oKhA5IvFDXN+nRsDWbLk3OK4/y3hz8foulLWtoG8RxTjSjnqIxtUGaOJ6VPhcuHPx9qZIbFDpmcXzr8+dCnwmU+AWezB+HdbibuSZp93B7k5MbvNwBEQICaaM00viJIMOCr1e2P7ij6e7/+LhuNpTSViVK58i/p3XSD/EG+Nkla2hgcndK8LE3z0Ux1xCVQClEvEEJzENILqieLmQ1eoZ0aHJ3i20wLljDV0rdV0zBjTL3mb93D8u7GLJ5r+mTRrhn/6XgEFN1UhFOv6UIfIej6qfWN4WiCns8nD2/hovXxc8pTIN+e4LjPIH7R88O3Eu2tPTdIzil0TZbKlRY/+fBPa+FbCX7tyTmFfpA9LuEgD5EG2C30BYSDiBZh47dkzXqjmvxB/cDC0ZX12lEj80SOJ6XJ+SVa6lcTj2aDO2a2i0VCvvJNWr1tcLvh1WI2vZuG2iU0V9ZK3b05ODpFt8O56Mopm1VW9ZpXV4monZrSoeP8t4LTsbXd0c1iNi3XWCBOcDpoqY/wrUQ8+eMnZhO9vkE/vLlogqla3vta6m6YErPj/Lcel8DDunribj11172tX1uKXzTqPBCKpKUNR/aK4HRsFV68y38Bo8Wd1VXDrcKLFs+Fvmtm+FaC5squdemqN6D1DLo4PbGZU+ScMjg6pf7BsZhNbVzbEOAgQBUQDqiZ7VhWdy4Yqp/RYru1niN88W5CPqoX0h/KB/qxILQuSCwS4jdjOae0PnCYXpZqJDyg2G3WxekJ/a3RPyKm7oZpA97PFCi6f3pwx6COsry9zXwYtd1mrdUuzMuiBgWqPiI4HeurC7T7fCEyr+g2zuvqYea8Jb1pNA0KFbB5d9LZG2PG83ira7o9mUSGPjy0awPoOqfIy39w1GcHABp06O3bt4wx+pu++HPb69evX7169erVK6v1/c+Vek0kgAOCWmyrLummli8U+VqujDG+jKwBzbe05YeL579Gpg7m+Y8KKo28Pv8Wgx2Uc8qXl64xxoIBX7sWGunoyVUvMmaMshcd20YyB18Smg8Pasv53cwpZ5yOBs8adQ+w26zKox96cpBL5Qotakcpti2vSYNCSuXKGVXErHvuqFNmp4eQA3Ti1xR9/fXXX3/zzTcDAwPHjh07fvz4wMDAiRMnBgYGjh8/fuzYsYGBgY8++ujDDz88evTo0aNHjxw58sG2Q4cOvUt+21+gIRigvlgk9NWla6VyZXB0yiAF8mGbjWviWxopkzQeJZvoO9VIrKQbrXEjaZ/yuARPw0/OZGUaDx4MtG3ZZTq/jZ9i3mdRs3hgN/Pf7eX79CFqJti25f6EhpMfAFSFCAjQUFBbX10YHJ2qmwKB8h+Ns07dDR+05SvU4Y+P1fC4hJ5E4WgsNRddoZKtfjrr7giEIhRALWbT4vQE2moBEAEB+gx1NqJwk17bQASsGTuWU5T/+qtvFh/TQFrv8XLKdpLmHr8+7jMYMtxR1GtQcDq+n57obQcer+henP6uJ9dDI1OLAyACAkD9FBi/J2FNUgOL0xOlcoVGLfTFBp+pluZb33jqeNfbEOwV3b+J93t7ImaCY70dqIvO6wCIgADtSYGLzgkcBwPU/ttfmbVDr9zzIuheSOE9PwhdmDkcoE9hUhgAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQIB9I18oDo5O5QvFfbAvpXJlcHRKzin9ssGDo1OHPx/S/MlkZVyW+4P+5B7+fAiHBQAREKD35Jzy1aVrmawcCEV6kn4c578tlSvtes1AKJLJyv2VAgEAEAEBoNv5b3B0qlSuWMym76cnuvnWmaxMta58oXh7+X67XnY2OGYxm/quFjh7Y+zNL2v8j8cl6J9TKlfmoiuO899SJWlwdCotbXR5O/OFYiAU+VS4fPjzoU+Fy76rs90/yJms7Ls6SwfBcf7bQCjSxo8QuxWNpeaiKwYVdPVpXV9dwO8cAERAgN4rlSvjoQjlv/XVBcHp6Oa7W8wm/vWpz6ztelnB6VhfXeApsIfhoO0na3B0KnwrwdMGJaG56Eo3PzB8delaPCnRUS2VK2lpo8tJNJ6U1O+YLxTjSemrS9d60o0hEIpMzi+FbyW2Ci/w+wQAERCgb8xFV6iE0/38R1ktFgl5XEIw4POPiO19ZSq3lMqV7rdud8jk/BKdrNkbY08f3FlfXfCKbsZY+Faiax0HfVdn6QNDLBJ6+uBO6m5YcDroIHcnalMNkk5x6m746YM7sUjIYjbxx7smXyh+eelaPCnh1whA047gEAD0RL5QjMZSFCm6n/+If0Rsb/hTp8DZG2PhW4m0tJHJylXbVfvrZFHamL0xNhO8Qg96XILj/Lf5QjEcXVnv/A7GkxJV2lJ3b9LxFJwOwemgrpy3l+/zDevkh5YEY4yK1lRFpkuXOoB250TnC8W5aIJOBxWb8csEoDmoAgL0Bt1N7TZrg3fuUrmSycrxpDQXXeHd+GrdI9PSRjSWymTldt0g5ZxCb0p/GnnZmeAVu83KGAt3sam0Q24vpyhwXB+/vHMfxxhjmazchWbQ6HKKcqc6ZtltVgrx8eSPnd6AUrlCwSsY8Kl7EfhHRDrR8XvdqMn97+jfaTP8I2IsEsJvEoCmoQoI0Crf1dm0tGExm36T79f6X8HpePrgjv5uShnCWL5QnJz/Z9X+XsqjH+juy59J9Rj+iMVsCgZ8mpSpnxRDXdzSGBydqho3gwHfYr3xKzPBMdqefKGo3s6+s5lTGGMe12l19GGMeUU3CzHGWFp6HAz4OroN1AztHXJrHvcOnaMCoZxTOlpO5uNOqmyDeC4aS6WlDdb51mD/yMV84YV/WPS4BMzdA9AKVAEBWuUdOkeprurYzEz2GWPsws4GMp7nqD+ZAeprz59PRSCPS7CYTXabVZP/aHIZxpjgdNBzSuVK+FailX5a/C5rt1nVJahoLDU5v1Tvbi1SZkpLj/v6FNNBOOP8i+ZxOguMsa1fi13YAMbYGV3I47Gv05XIzJNnmnd8vw1fOOhHoAvNsjPBK9SHFb95AFqEKiBAq/jdKH5PWnROaKIe3RT9w+LOu6lM36ipKmnIOYXSm8VsWpye0PTb09zyA9uDi3lfMXownpTiSclzVuDf/uaXNf5dDU6Tqy4T0vgDamu+Pu4zLu95RXc8KWWeyE0XyRrZQvUedQ6Nm5Zzyt/ml+w26+L0hMVsOmU7mS8UN7s1M8sp20n6YBC/J11wCby1nTEmP1fqfqJoHX+7yfmlzZwSHPd5Rff7bcgpCGcA/QJVQIA23BSpLqKvdaXXHqufoElvF+rdLMOq3vf6cRvq7MV7By5OT6jvwbxeQl0P28JiNi1Of7edcR/XOzgnmaoNsR/xjd/u2pigTpndHI66uXMbJueXMlk5fCvRzQO7mfuZZ1A5p1Bn030z4hvgAEIVEKANguO+QCii6Y9F07YxxvwjF2ulCgOqbxfr9vFKr21QONMnxeC4j3rjtbGvGLVB5wvF3/94afxMz9nTrLU2yu5U+IxPxM5/vuz+NmiOc0+Gwap3HONwAfYBVAEB2oA3wKkHRfJW4LFh0ThVVGXQ+16PqkRVEx5/sL2NlVQNengA+uNrGuuD4z6L2SQ4HR2aT6eqTz4+of7n7I2xBj8btPU4vN8G3it0sbtL2gAAIiDAnksJlALVDaPUCqzuKdUT/N07PWRhv9JkaK/o/k2+//TBHeN+nO3FR4HQB4OZ4JU3v6x1eUoUGg3DP5msry68+WWtmzkYANoLDcEA7eEfvpiWNnh7K03OR49XTRV93T2uca0PU21kBbYuzIrMdK2x74J14QVroFtnu+jrx/wIU5t79zdgxzZgLAgAIiDAQUPVvnyhGF1OxSKhxD2JMWa3WasO0rTbrHJOof71tfAiU75Q9NR7d2qkq3p75vOJdCciaMjPlRaTQfhW/YEsnY6ANAWd/nyVyhVKP21cZLnWBtAXm7oht/yzBDXNd3Abtq8ffadSOst9PfUjwAGEhmCAtrk+7mPbXQBp8Tf9QJDtu6nAtqcMrEVwOuie2shgXnpBOafoq240UoTV6CnYaQ/fzajX/FvzqRAN/nQ83w+5mapz5/tjq5qvsdb38gVdWv+MwaqtwMFHnddKYHQ1zkVXWqzI8jmMqmyD9Nj4IOQLxbnoSjSWwjgSgL0DVUCAtvGPiHPRlVK5Mjg6RVP0adYT46jdkMb8Gszl5h+5GL6VoDU/aBa6um/tuzrL129l25N3MNUszfqgKeeUh1m5E4U0ahbnCbU566sLPT+zXvEczYM9F13hAyBK5Qqlc49LqBW/5qIrvIqpWcpl9zH0XFrakHOK+pqRc8r2amkXa32j7+pNCqDx5I/Kox9ajKE0G456PshoLEXh0j8s1sqgX126RuFPfq5gVTcARECA/YbmZInGUpR7NEupaoIXZa/ocsogAs4Er/xLeky3+bS04R8RT31mPeN05AvFrV+Lm7mfZ4NjVNujuaMDoYicU766dO36uM9uO5l5IlM+oP+tFUZp/d9AKOIdOpcvvJCfK+p5pFvB10HuwpTFHWW3WYMBXzSWojqWf1gslV+GowmKPt/XHhWrHjGduCe1krP9I2J0OUWzhQef+zxnT+cLRQqmdpu11ocNpuoJkC8UM1m5laLpTHCMSqH/O/r3meCY3WbNPHlGGdcrumu9spxTePEvnpRajID5QjGxXYbMF168e9l7Ei1e4jl7Gv0RARABAXrg+riPqm6s2lwwajSVYCYrGxcC11cXeAsaf+X3sWD4Im/epdBG0xOql26z26ypu+FaYXQmeCWelGjNYj7XcVvGuuYLRXpBah/vd4vTE/woqSeFjkVCBi3s6olUWre+ujA4OiXnFHX/SIvZZHB+6Qntan6126y0DVSZVn+kMQh27R06vVV4oe8eys/I7I0xREAARECAHrDbrLFIaOvX4qnPrMatfv4RMX5PovKb4lqpdZukAt71cV/inqQuKdltVuELhyZ8+EdEj0u4vZyi6Uss5hPeoXNe0W2cD5RHK3PRFT5r4AWXoA6vNAWdfiiJf1i84BIMhkH4rs6y7frZ/ji5sUjIc1aI35O2Ci8+MZvOOB3BcZ9xD8tYJBRgEeoySAvptoLWibm9fP9hVqYxGWecjpngFeOMtb66MB6KyLVnjtwVwen46cGduWiCD373DrmNTzEFRPqW1jfglO0kXZNV9WTME0CfOvT27VvGGP1NX/y57fXr169evXr16pXValX/NsFnLIC2kHPKl5eu0T1S3YFvH6C1ifviN8bg6FQmK6sXQW6vaCw1Ob9kt1lb7IfXCt/VWepI0Kt+eNQdMF8oxiKhHk4lmMnKg6NTbA8sOQPQxK8p+vrrr7/+5ptvBgYGjh07dvz48YGBgRMnTgwMDBw/fvzYsWMDAwMfffTRhx9+ePTo0aNHjx45cuSDbYcOHXqX/La/wIhggJ7hzWdyTqERJPss/2kWLD5QqB18cHSKGuV7kr1oMJDj/LdpacOgP2inU9fk/JLj/JV8oehxCZhKGmDvQEMwQC/xDnxyTvFdvbkXRr+2K//5R8Q+agLWrHQ3Niy2OMvdVuEF9ZajvgE9icLptQ3qNudxCbFIqCdl5vD2hDjBgK87M3hrqKcW58NHAAAREGAPpUD/8H4okPiHxXhS6ly7aodksrJ66j7P2dMtRkCqeHnOCsZ9MTvKO+QulSu0DT28Hi64BO+QuyfTUrLGphYHQAQEgJ6lQIO55fqLxyU8fXCnV/f7Jnw/PaFvgm/L9vd8AjzB6Vh0TvT82u7tBuyDyjoAIiDAfrafFtfqo/zXd1sLTXwmwUEAqArDQQAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAAAAERAAAAAAEAEBAAAAABEQAAAAABABAQAAAAAREAAAAAAQAQEAAOD/t3fvum0cUQCGl7uzYnq/g2FAgJ7XrftED+AuhYu0eYMAKfIGunG5KQaaHM1eeAkjS873FQRNEhRhNj/OzCxBAgIAIAEBAJCAAABIQAAACQgAgAQEAEACAgAgAQEAkIAAAEhAAAAkIAAAEhAAAAkIAIAEBABAAgIAIAEBAJCAAABIQAAAJCAAwI9js9lIQACAH7PzcurF2/O0ly1KAABeuQsvn4AAALyv4DvmNRIQAOD9VWBZFD5PezAerQUDALyd/itmO+3IimsPNiYAAG8tAZuX48D4+DEVl475S/GfX375+utvv/sCAABewR9//lXut21bUi/fj4+c9LbppP5rmubLz199GQAA30XXdW3bdl037b+TWtBxEACA96Ft29x/5TY7dRX4nwRc2kuY3+76+tp/OgDA9/Xx48eUUtd1pf/KonAMwaUEjI+n2afHcYwVeHt7+/nz52/fvu0XjBNN0+RbAADOEM98fPjw4ebm5tOnTyUByyww35kOAtd/QWRTQq102ziOOeyGYRiGYbfbPT09PT4+Pjy7u7t7eHi4v7+/v7/Pjzw+Pj49Pe12u91uNwxD1YXx/QEAWGm+Juzqy23XdV1KKaV09Wy73f700na73W63V1dXfd/PTgqbg1PAqjrLJyjJmVIahiGl1Pd9Tr38svyaKgFjXPpqAQDWE7A625FLLidg3/d935fOq1JvWnsrmwIXF4JL+e33+3LqOEdeTsCl/huGIc8RqykgAABHtmAcBJYxXOm/UoFl5je9Xsx6BabZ/mvmRoDDMJQpYJnzxf6Lq8D7/b55XlbO76wFAQDWy29agXHPX26+shwc13zjsm/1brMVuLgQPB0E5j9cnfmo+m+6EVD5AQCc2oK528pewFJisQKr5eDZcyFLUpV90/aMS8B5hTeHYNV/1Qhwei5YCwIArJdfvBMzLEvP+qBUYHzlNASrKDw8BYxrweM4ppRiI8bdgXEEWBaCfZ0AAOflYPwVuLgjMIZg7L/q0oBnHgdpnud8ZZ6XKzC/rDodMh0Bxv7TggAAR5ZfuV9dGqakV1kULuIUML7VUcdB4lpwtR2waZqu68pTbdvudrt4FZh8W84CNy/PggAAcGoOVrvyygCunA4pt3EvYDwUMi3LmQSsWjAOAmc/UP4Qsf+mvxFSEYUAALP1NZuAJcZiCK7/QMgxJ0LS+kcpIRiHisMwxCvFlMmfXwQBALhsFE5ngSXDYhHGZ5szLg0drws4jmO+Ikx+Ks4C8+N5d2Apv3gExCowAMBFErAJx3CrdeGqC2fPgsyG4GZ9uTbfqRZ5Y/MtLf76UTgAgH/Zf1UIVqu90/KbXQU+JwFjBTbPs70lK+WnBQEAToq/+EgVgk3YIFjSsAmrxuv9t5iAsxUYO6/KvtnNf7IPAOBSLRgjr6q9pWsBruwF3KyE2mzPTQd+sg8A4NWKsKq9pfMf6yeCN+vRVj27tNXvmFEiAACnBt/K4ytrvgd/I3hzTKUtheBsFwIA8N9F4cHLPh/sv2MTsDHnAwB4ezl4RvydloCaDwDgfRXhJRNQCwIAvNPyu0AC6kIAgLdfe6+RgAAAvH2t/wIAgP+bvwFVi/qqUyboLwAAAABJRU5ErkJggg==" />

It’s also possible to use more sophisticated embeddings. The basic idea is similar to that for the one-hot encoding. Each word is associated with a unique vector. However, the key difference is that it’s possible to learn this encoding vector directly from data to obtain a “word embedding” for the word in question that’s meaningful for the dataset at hand. We will show you how to learn word embeddings later in this chapter.

In order to process the Penn Treebank data, we need to find the vocabulary of words used in the corpus, then transform each word into its associated word vector. We will then show how to feed the processed data into a TensorFlow model.

> **_Penn Treebank Limitations_**<br>The Penn Treebank is a very useful dataset for language modeling, but it no longer poses a challenge for state-of-the-art language models; researchers have already overfit models on the peculiarities of this collection. State-of-the-art research would use larger datasets such as the billion-word-corpus language benchmark. However, for our exploratory purposes, the Penn Treebank easily suffices.

# Code for Preprocessing

The snippet of code in Example 7-1 reads in the raw files associated with the Penn Treebank corpus. The corpus is stored with one sentence per line. Some Python string handling is done to replace "
" newline markers with fixed-token "<eos>" and then split the file into a list of tokens.

Example 7-1. This function reads in the raw Penn Treebank file

In [None]:
def _read_words(filename):
  with tf.gfile.GFile(filename, "r") as f:
    if sys.version_info[0] >= 3:
      return f.read().replace("
", "<eos>").split()
    else:
      return f.read().decode("utf-8").replace("
", "<eos>").split()

With _read_words defined, we can build the vocabulary associated with a given file using function _build_vocab defined in Example 7-2. We simply read in the words in the file, and count the number of unique words in the file using Python’s collections library. For convenience, we construct a dictionary object mapping words to their unique integer identifiers (their positions in the vocabulary). Tying it all together, _file_to_word_ids transforms a file into a list of word identifiers (Example 7-3).

Example 7-2. This function builds a vocabulary consisting of all words in the specified file

In [None]:
def _build_vocab(filename):
  data = _read_words(filename)
  counter = collections.Counter(data)
  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
  words, _ = list(zip(*count_pairs))
  word_to_id = dict(zip(words, range(len(words))))
  return word_to_id

Example 7-3. This function transforms words in a file into id numbers

In [None]:
def _file_to_word_ids(filename, word_to_id):
  data = _read_words(filename)
  return [word_to_id[word] for word in data if word in word_to_id]

With these utilities in place, we can process the Penn Treebank corpus with function ptb_raw_data (Example 7-4). Note that training, validation, and test datasets are pre-specified, so we need only read each file into a list of unique indices.

Example 7-4. This function loads the Penn Treebank data from the specified location

In [None]:
def ptb_raw_data(data_path=None):
  """Load PTB raw data from data directory "data_path".
  Reads PTB text files, converts strings to integer ids,
  and performs mini-batching of the inputs.
  The PTB dataset comes from Tomas Mikolov's webpage:
  http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
  Args:
    data_path: string path to the directory where simple-examples.tgz
               has been extracted.
  Returns:
    tuple (train_data, valid_data, test_data, vocabulary)
    where each of the data objects can be passed to PTBIterator.
  """
  train_path = os.path.join(data_path, "ptb.train.txt")
  valid_path = os.path.join(data_path, "ptb.valid.txt")
  test_path = os.path.join(data_path, "ptb.test.txt")
  word_to_id = _build_vocab(train_path)
  train_data = _file_to_word_ids(train_path, word_to_id)
  valid_data = _file_to_word_ids(valid_path, word_to_id)
  test_data = _file_to_word_ids(test_path, word_to_id)
  vocabulary = len(word_to_id)
  return train_data, valid_data, test_data, vocabulary

> **_tf.GFile and tf.Flags_**<br>TensorFlow is a large project that contains many bits and pieces. While most of the library is devoted to machine learning, there’s also a large proportion that’s dedicated to loading and massaging data. Some of these functions provide useful capabilities that aren’t found elsewhere. Other parts of the loading functionality are less useful, however.<br>tf.GFile and tf.FLags provide functionality that is more or less identical to standard Python file handling and argparse. The provenance of these tools is historical. With Google, custom file handlers and flag handling are required by internal code standards. For the rest of us, though, it’s better style to use standard Python tools whenever possible. It’s much better for readability and stability.

# Loading Data into TensorFlow

In this section, we cover the code needed to load our processed indices into TensorFlow. To do so, we will introduce you to a new bit of TensorFlow machinery. Until now, we’ve used feed dictionaries to pass data into TensorFlow. While feed dictionaries are fine for small toy datasets, they are often not good choices for larger datasets, since large Python overheads involving packing and unpacking dictionaries are introduced. For more performant code, it’s better to use TensorFlow queues.

tf.Queue provides a way to load data asynchronously. This allows decoupling of the GPU compute thread from the CPU-bound data preprocessing thread. This decoupling is particularly useful for large datasets where we want to keep the GPU maximally active.

It’s possible to feed tf.Queue objects into TensorFlow placeholders to train models and achieve greater performance. We will demonstrate how to do so later in this chapter.

The function ptb_producer introduced in Example 7-5 transforms raw lists of indices into tf.Queues that can pass data into a TensorFlow computational graph. Let’s start by introducing some of the computational primitives we use. tf.train.range_input_producer is a convenience operation that produces a tf.Queue from an input tensor. The method tf.Queue.dequeue() pulls a tensor from the queue for training. tf.strided_slice extracts the part of this tensor that corresponds to the data for the current minibatch.

Example 7-5. This function loads the Penn Treebank data from the specified location

In [None]:
def ptb_producer(raw_data, batch_size, num_steps, name=None):
  """Iterate on the raw PTB data.
  This chunks up raw_data into batches of examples and returns
  Tensors that are drawn from these batches.
  Args:
    raw_data: one of the raw data outputs from ptb_raw_data.
    batch_size: int, the batch size.
    num_steps: int, the number of unrolls.
    name: the name of this operation (optional).
  Returns:
    A pair of Tensors, each shaped [batch_size, num_steps]. The
    second element of the tuple is the same data time-shifted to the
    right by one.
  Raises:
    tf.errors.InvalidArgumentError: if batch_size or num_steps are
    too high.
  """
  with tf.name_scope(name, "PTBProducer",
                     [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data",
                                    dtype=tf.int32)
    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0 : batch_size * batch_len],
                      [batch_size, batch_len])
    epoch_size = (batch_len - 1) // num_steps
    assertion = tf.assert_positive(
        epoch_size,
        message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")
    i = tf.train.range_input_producer(epoch_size,
                                      shuffle=False).dequeue()
    x = tf.strided_slice(data, [0, i * num_steps],
                         [batch_size, (i + 1) * num_steps])
    x.set_shape([batch_size, num_steps])
    y = tf.strided_slice(data, [0, i * num_steps + 1],
                         [batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

> **_tf.data_**<br>TensorFlow (from version 1.4 onward) supports a new module tf.data with a new class tf.data.Dataset that provides an explicit API for representing streams of data. It’s likely that tf.data will eventually supersede queues as the preferred input modality, especially since it has a well-thought-out functional API.<br>At the time of writing, the tf.data module was just released and remained relatively immature compared with other parts of the API, so we decided to stick with queues for the examples. However, we encourage you to learn about tf.data yourself.

# The Basic Recurrent Architecture

We will use an LSTM cell for modeling the Penn Treebank, since LSTMs often offer superior performance for language modeling challenges. The function tf.contrib.rnn.BasicLSTMCell implements the basic LSTM cell for us already, so no need to implement it ourselves (Example 7-6).

Example 7-6. This function wraps an LSTM cell from tf.contrib

In [None]:
def lstm_cell():
  return tf.contrib.rnn.BasicLSTMCell(
      size, forget_bias=0.0, state_is_tuple=True,
      reuse=tf.get_variable_scope().reuse)

> **_Is Using TensorFlow Contrib Code OK?_**<br>Note that the LSTM implementation we use is drawn from tf.contrib. Is it acceptable to use code from tf.contrib for industrial-strength projects? The jury still appears to be out on this one. From our personal experience, code in tf.contrib tends to be a bit shakier than code in the core TensorFlow library, but is usually still pretty solid. There are often many useful libraries and utilities that are only available as part of tf.contrib. Our recommendation is to use pieces from tf.contrib as necessary, but make note of the pieces you use and replace them if an equivalent in the core TensorFlow library becomes available.

The snippet in Example 7-7 instructs TensorFlow to learn a word embedding for each word in our vocabulary. The key function for us is tf.nn.embedding_lookup, which allows us to perform the correct tensorial lookup operation. Note that we need to manually define the embedding matrix as a TensorFlow variable.

Example 7-7. Learn a word embedding for each word in the vocabulary

In [None]:
with tf.device("/cpu:0"):
  embedding = tf.get_variable(
      "embedding", [vocab_size, size], dtype=tf.float32)
  inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

With our word vectors in hand, we simply need to apply the LSTM cell (using function lstm_cell) to each word vector in our sequence. To do this, we simply use a Python for-loop to construct the needed set of calls to cell(). There’s only one trick here: we need to make sure we reuse the same variables at each timestep, since the LSTM cell should perform the same operation at each timestep. Luckily, the method reuse_variables() for variable scopes allows us to do so without much effort. See Example 7-8.

Example 7-8. Apply LSTM cell to each word vector in input sequence

In [None]:
outputs = []
state = self._initial_state
with tf.variable_scope("RNN"):
  for time_step in range(num_steps):
    if time_step > 0: tf.get_variable_scope().reuse_variables()
    (cell_output, state) = cell(inputs[:, time_step, :], state)
    outputs.append(cell_output)

All that remains now is to define the loss associated with the graph in order to train it. Conveniently, TensorFlow offers a loss for training language models in tf.contrib. We need only make a call to tf.contrib.seq2seq.sequence_loss (Example 7-9). Underneath the hood, this loss turns out to be a form of perplexity.

Example 7-9. Add the sequence loss

In [None]:
# use the contrib sequence loss and average over the batches
loss = tf.contrib.seq2seq.sequence_loss(
   logits,
   input_.targets,
   tf.ones([batch_size, num_steps], dtype=tf.float32),
   average_across_timesteps=False,
   average_across_batch=True
)
# update the cost variables
self._cost = cost = tf.reduce_sum(loss)

> **_Perplexity_**<br>Perplexity is often used for language modeling challenges. It is a variant of the binary cross-entropy that is useful for measuring how close the learned distribution is to the true distribution of data. Empirically, perplexity has proven useful for many language modeling challenges and we make use of it here in that capacity (since the sequence_loss just implements perplexity specialized to sequences inside).

We can then train this graph using a standard gradient descent method. We leave out some of the messy details of the underlying code, but suggest you check GitHub if curious. Evaluating the quality of the trained model turns out to be straightforward as well, since the perplexity is used both as the training loss and the evaluation metric. As a result, we can simply display self._cost to gauge how the model is training. We encourage you to train the model for yourself!

# Challenge for the Reader

Try lowering perplexity on the Penn Treebank by experimenting with different model architectures. Note that these experiments might be time-consuming without a GPU.