# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# Natural Language Processing, with NLTK

###  February 22, 2018

### Discovering network structure is just one aspect of social media mining. Let's look at the actual *content* users generate on social media, starting with data provided by the NLTK package.

### Running the next cell should bring up a window prompting you to select data for download. You should select "book" in order to get everything used in the [NLTK Book](http://www.nltk.org/book/).

In [3]:
# Importing NLTK; included in Conda distro
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## Reuters Corpus

### Let's start with `reuters`, a *corpus* taken from Reuters' reporting. This corpus is already grouped into categories and into *training* and *test* sets.

In [5]:
from nltk.corpus import reuters

### A *corpus* is a collection of documents. The documents in this case are articles, but in general could be other things, such as individual tweets. Think "body" of work - the root means body.

### Here we have a list of all the documents' ids.

In [7]:
reuters_ids = reuters.fileids()

### The first and last documents...

In [8]:
print(reuters_ids[0],reuters_ids[-1])

('test/14826', 'training/9995')


### Notice that there are *test* and *training* documents. We will use that when we do classification, where the training documents will be used to 'learn' a model, and the test documents to evaluate the quality of that model.

### The `reuters` corpus is grouped into many overlapping categories...

In [9]:
reuters_cats = reuters.categories()
print(reuters_cats, len(reuters_cats))

([u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', u'cotton', u'cotton-oil', u'cpi', u'cpu', u'crude', u'dfl', u'dlr', u'dmk', u'earn', u'fuel', u'gas', u'gnp', u'gold', u'grain', u'groundnut', u'groundnut-oil', u'heat', u'hog', u'housing', u'income', u'instal-debt', u'interest', u'ipi', u'iron-steel', u'jet', u'jobs', u'l-cattle', u'lead', u'lei', u'lin-oil', u'livestock', u'lumber', u'meal-feed', u'money-fx', u'money-supply', u'naphtha', u'nat-gas', u'nickel', u'nkr', u'nzdlr', u'oat', u'oilseed', u'orange', u'palladium', u'palm-oil', u'palmkernel', u'pet-chem', u'platinum', u'potato', u'propane', u'rand', u'rape-oil', u'rapeseed', u'reserves', u'retail', u'rice', u'rubber', u'rye', u'ship', u'silver', u'sorghum', u'soy-meal', u'soy-oil', u'soybean', u'strategic-metal', u'sugar', u'sun-meal', u'sun-oil', u'sunseed', u'tea', u'tin', u'trade', u'veg-oil', u'wheat', u'wpi', u'yen', u'zinc']

### That was all of them, but the function `categories` can be applied to a particular document to get its categories.

In [10]:
reuters.categories('training/9865')

[u'barley', u'corn', u'grain', u'wheat']

### ...or a list of documents

In [13]:
reuters.categories(['training/9865','training/9880'])

[u'barley', u'corn', u'grain', u'money-fx', u'wheat']

### And you can pass a category or list of categories to the `fileid` function.

In [14]:
reuters.fileids('barley')

[u'test/15618',
 u'test/15649',
 u'test/15676',
 u'test/15728',
 u'test/15871',
 u'test/15875',
 u'test/15952',
 u'test/17767',
 u'test/17769',
 u'test/18024',
 u'test/18263',
 u'test/18908',
 u'test/19275',
 u'test/19668',
 u'training/10175',
 u'training/1067',
 u'training/11208',
 u'training/11316',
 u'training/11885',
 u'training/12428',
 u'training/13099',
 u'training/13744',
 u'training/13795',
 u'training/13852',
 u'training/13856',
 u'training/1652',
 u'training/1970',
 u'training/2044',
 u'training/2171',
 u'training/2172',
 u'training/2191',
 u'training/2217',
 u'training/2232',
 u'training/3132',
 u'training/3324',
 u'training/395',
 u'training/4280',
 u'training/4296',
 u'training/5',
 u'training/501',
 u'training/5467',
 u'training/5610',
 u'training/5640',
 u'training/6626',
 u'training/7205',
 u'training/7579',
 u'training/8213',
 u'training/8257',
 u'training/8759',
 u'training/9865',
 u'training/9958']

In [15]:
reuters.fileids(['barley','corn'])

[u'test/14832',
 u'test/14858',
 u'test/15033',
 u'test/15043',
 u'test/15106',
 u'test/15287',
 u'test/15341',
 u'test/15618',
 u'test/15648',
 u'test/15649',
 u'test/15676',
 u'test/15686',
 u'test/15720',
 u'test/15728',
 u'test/15845',
 u'test/15856',
 u'test/15860',
 u'test/15863',
 u'test/15871',
 u'test/15875',
 u'test/15877',
 u'test/15890',
 u'test/15904',
 u'test/15906',
 u'test/15910',
 u'test/15911',
 u'test/15917',
 u'test/15952',
 u'test/15999',
 u'test/16012',
 u'test/16071',
 u'test/16099',
 u'test/16147',
 u'test/16525',
 u'test/16624',
 u'test/16751',
 u'test/16765',
 u'test/17503',
 u'test/17509',
 u'test/17722',
 u'test/17767',
 u'test/17769',
 u'test/18024',
 u'test/18035',
 u'test/18263',
 u'test/18482',
 u'test/18614',
 u'test/18908',
 u'test/18954',
 u'test/18973',
 u'test/19165',
 u'test/19275',
 u'test/19668',
 u'test/19721',
 u'test/19821',
 u'test/20018',
 u'test/20366',
 u'test/20637',
 u'test/20645',
 u'test/20649',
 u'test/20723',
 u'test/20763',
 u'test/

### We can get the words appearing in a list of documents or in documents belonging to a specified category.

In [16]:
reuters.words('training/9865')[:14]

[u'FRENCH',
 u'FREE',
 u'MARKET',
 u'CEREAL',
 u'EXPORT',
 u'BIDS',
 u'DETAILED',
 u'French',
 u'operators',
 u'have',
 u'requested',
 u'licences',
 u'to',
 u'export']

In [17]:
reuters.words(['training/9865','training/9880'])

[u'FRENCH', u'FREE', u'MARKET', u'CEREAL', u'EXPORT', ...]

In [18]:
reuters.words(categories='barley')

[u'FRENCH', u'FREE', u'MARKET', u'CEREAL', u'EXPORT', ...]

In [19]:
reuters.words(categories=['barley','corn'])

[u'THAI', u'TRADE', u'DEFICIT', u'WIDENS', u'IN', ...]

### What about the actual content?? The `raw` function gives that in a string.

In [20]:
print(reuters.raw('test/14826'))

ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict would hurt
  them in the long-run, in the short-term Tokyo's loss might be
  their gain.
      The U.S. Has said it will impose 300 mln dlrs of tariffs on
  imports of Japanese electronics goods on April 17, in
  retaliation for Japan's alleged failure to stick to a pact not
  to sell semiconductors on world markets at below cost.
      Unofficial Japanese estimates put the impact of the tariffs
  at 10 billion dlrs and spokesmen for major electronics firms
  said they would virtually halt exports

### For more on available methods, see `help`.

In [21]:
help(nltk.corpus.reader)

Help on package nltk.corpus.reader in nltk.corpus:

NAME
    nltk.corpus.reader

FILE
    /anaconda3/envs/Social-Media-Mining/lib/python2.7/site-packages/nltk/corpus/reader/__init__.py

DESCRIPTION
    NLTK corpus readers.  The modules in this package provide functions
    that can be used to read corpus fileids in a variety of formats.  These
    functions can be used to read both the corpus fileids that are
    distributed in the NLTK corpus package, and corpus fileids that are part
    of external corpora.
    
    Corpus Reader Functions
    Each corpus module defines one or more "corpus reader functions",
    which can be used to read documents from that corpus.  These functions
    take an argument, ``item``, which is used to indicate which document
    should be read from the corpus:
    
    - If ``item`` is one of the unique identifiers listed in the corpus
      module's ``items`` variable, then the corresponding document will
      be loaded from the NLTK corpus package.
   

## Movie Reviews Corpus (from Lee Pang)

In [63]:
from nltk.corpus import movie_reviews

In [64]:
movie_ids = movie_reviews.fileids()

In [65]:
print(movie_ids[0],movie_ids[-1])

(u'neg/cv000_29416.txt', u'pos/cv999_13106.txt')


### These are split into *negative* and *positive* movie reviews - this is the sort of classification we would like to do for sentiment analysis.

In [66]:
movie_reviews.categories()

[u'neg', u'pos']

In [67]:
movie_reviews.categories('neg/cv000_29416.txt')

[u'neg']

In [68]:
print(len(movie_reviews.fileids('neg')), len(movie_reviews.fileids('pos')))

(1000, 1000)


### Again, we can look at the raw content of the document.

In [69]:
movie_reviews.raw('neg/cv000_29416.txt')

u'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience memb

## Twython

### The NLTK Twitter module depends on the Twython package.

### This is another python package for interacting with the Twitter API. Maybe you'll prefer it. The first three examples below use the public stream (no credentials required).

### See the [NLTK Twitter HOWTO](http://www.nltk.org/howto/twitter.html) for more details.

In [74]:
from nltk.twitter import Twitter
tw = Twitter()
tw.tweets(keywords='love, hate', limit=10) #sample from the public stream

ImportError: cannot import name Twitter

In [75]:
tw = Twitter()
tw.tweets(follow=['759251', '612473'], limit=10) # see what CNN and BBC are talking about

NameError: name 'Twitter' is not defined

In [76]:
tw = Twitter()
tw.tweets(to_screen=False, limit=25)

NameError: name 'Twitter' is not defined

### Let's use credentials. They must be stored in a file with the name "credentials.txt" kept in your *twitter-files* directory. The file must have the following format:

```
app_key=YOUR_CONSUMER_KEY  
app_secret=YOUR_CONSUMER_SECRET  
oauth_token=YOUR_ACCESS_TOKEN  
oauth_token_secret=YOUR_ACCESS_TOKEN_SECRET
```

In [7]:
from nltk.twitter import Query, Streamer, Twitter, TweetViewer, TweetWriter, credsfromfile

In [None]:
oauth = credsfromfile()
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.sample()

In [None]:
client = Streamer(**oauth)
client.register(TweetViewer(limit=10))
client.filter(track='refugee, germany')

In [None]:
client = Query(**oauth)
tweets = client.search_tweets(keywords='nltk', limit=10)
tweet = next(tweets)
from pprint import pprint
pprint(tweet, depth=1)

### (What is `next`?)

In [None]:
help(next)

### Printing those tweets

In [None]:
for tweet in tweets:
    print(tweet['text'])

### Now for some initial processing steps. Ultimately, you'll want a mathematical representation of tweets, reviews, posts - whatever you are trying to classify.

In [44]:
from nltk import *

## Tokenization

In [45]:
s = 'We bought apples, oranges, etc., etc.'
tokens = tokenize.word_tokenize(s)
print(tokens)

['We', 'bought', 'apples', ',', 'oranges', ',', 'etc.', ',', 'etc', '.']


### It may seem trivial because all we are doing is breaking the sentence down into words. But which things count? Notice that the commas appear in our list of tokens as well. With special characters thrown into the mix, as in a tweet, things become even more complicated.

In [46]:
t = '''#qcpoli enjoyed a hearty laugh today with #plq
    debate audience for @jflisee #notrehome tune was that the intended reaction?'''

In [47]:
tt = TweetTokenizer()

In [48]:
tokens2 = tt.tokenize(t)

In [49]:
print(tokens2)

[u'#qcpoli', u'enjoyed', u'a', u'hearty', u'laugh', u'today', u'with', u'#plq', u'debate', u'audience', u'for', u'@jflisee', u'#notrehome', u'tune', u'was', u'that', u'the', u'intended', u'reaction', u'?']


### These results are different from what you'd get from the old-fashioned tokenizer:

In [50]:
tokens3 = tokenize.word_tokenize(t)
print(tokens3)

['#', 'qcpoli', 'enjoyed', 'a', 'hearty', 'laugh', 'today', 'with', '#', 'plq', 'debate', 'audience', 'for', '@', 'jflisee', '#', 'notrehome', 'tune', 'was', 'that', 'the', 'intended', 'reaction', '?']


### Tokenization is just the fundamental first step toward a model. Whether you use N-grams, Word2Vec, Bag-of-Words, Naïve Bayes, or whatever, you will almost certainly start with tokenization. Because we need to chop things up into pieces before we can understand them.

### (We will look at each of those, don't worry if they sound alien.)

## Stemming/Lemmatization

### Many words are subtle variants of each other or of another more basic word. Examples:

- likes $\to$ like
- carries $\to$ carry
- books $\to$ book

### A natural next step after tokenization, particularly if you are taking frequency of words into account, is to identify root words whose variations occur as different tokens. For instance, if you are searching documents containing "democracy", you probably want results including documents containing "democratic" as well.

### Technically, *stemming* is the stripping away of prefixes/suffixes, and *lemmatization* is the stripping away of prefixes/suffixes so that the result is a legitimate word.

### This is a non-trivial task (lemmatization), and is based on *rules* and *dictionaries*. In other words, sometimes you can just do stemming (*stemmed* $\to$ *stem*), but other cases require a lookup (*sought* $\to$ *seek*).

### Furthermore, lemmatization cannot be done one token at a time, since parts of speech (POS) must be considered. Example:

- bored/bore/bear

### Stemming in NLTK

In [51]:
tokens = word_tokenize(s)
porter = PorterStemmer()
stems = [porter.stem(t) for t in tokens]
print(stems)

['We', 'bought', u'appl', ',', u'orang', ',', 'etc.', ',', 'etc', '.']


### (named for [Martin Porter](https://tartarus.org/martin/index.html).)

### Stemming with the Lancaster stemmer (named for Lancaster University).

In [52]:
lancaster = LancasterStemmer()
stems = [lancaster.stem(t) for t in tokens]
print(stems)

['we', 'bought', 'appl', ',', 'orang', ',', 'etc.', ',', 'etc', '.']


### Lemmatization in NLTK

In [53]:
wnl = WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

['We', 'bought', u'apple', ',', u'orange', ',', 'etc.', ',', 'etc', '.']


## Stopwords

In [54]:
from nltk.corpus import stopwords
import string

In [55]:
stop = stopwords.words('english')
print(stop)

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'eac

In [56]:
tokens_filtered = [w for w in tokens if w.lower() not in stop and w not in string.punctuation]
print(tokens_filtered)

['bought', 'apples', 'oranges', 'etc.', 'etc']


### Rmk: which languages are supported in `stopwords`? And what is included in `string.punctuation`?

In [57]:
print(stopwords.fileids())

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'greek', u'hungarian', u'italian', u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian', u'spanish', u'swedish', u'turkish']


In [58]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


## Frequency

### The most frequent words are often stopwords and can be deleted (depending on the application). Very rare words are often typos to be dismissed (or counted as an occurrence of another word). Surprisingly short dictionaries (200 words) suffice for many applications.

In [59]:
tokens = tokenize.word_tokenize(reuters.raw('test/14826'))
fdist = FreqDist(tokens)
print(fdist.most_common(100))

[(u'the', 32), (u'.', 31), (u'of', 30), (u',', 29), (u'to', 26), (u'said', 16), (u'a', 14), (u'trade', 13), (u'U.S.', 13), (u'in', 13), (u'and', 12), (u'Japan', 12), (u"'s", 12), (u'``', 7), (u'for', 7), (u"''", 7), (u'on', 6), (u'dlrs', 6), (u'imports', 5), (u'The', 5), (u'exports', 5), (u'are', 5), (u'is', 5), (u'it', 5), (u'be', 5), (u'tariffs', 5), (u'billion', 5), (u'And', 4), (u'would', 4), (u'U.S', 4), (u'Japanese', 4), (u'Hong', 4), (u'might', 4), (u'year', 4), (u'electronics', 4), (u'businessmen', 4), (u'that', 4), (u'also', 4), (u'Kong', 4), (u'between', 4), (u'Taiwan', 4), (u'Industry', 3), (u'will', 3), (u'Korea', 3), (u'semiconductors', 3), (u'&', 3), (u'lt', 3), (u'their', 3), (u'surplus', 3), (u'We', 3), (u';', 3), (u'against', 3), (u'>', 3), (u'In', 3), (u'short-term', 3), (u'with', 3), (u'at', 3), (u'South', 3), (u'very', 2), (u'exporters', 2), (u'boost', 2), (u'products', 2), (u'Australia', 2), (u'from', 2), (u'two', 2), (u'markets', 2), (u'pct', 2), (u'not', 2), (u'e

### Some basic summary stats...

In [60]:
print("Total number of tokens = {}".format(fdist.N()))
print("Total number of unique tokens = {}".format(len(fdist.keys())))

Total number of tokens = 816
Total number of unique tokens = 387


In [61]:
for token in fdist:
    print("Term " + token + " occurs " + str(fdist[token]) + " times.")

Term restraining occurs 1 times.
Term talks occurs 1 times.
Term long-run occurs 1 times.
Term row occurs 1 times.
Term whose occurs 1 times.
Term effort occurs 1 times.
Term to occurs 26 times.
Term program occurs 1 times.
Term include occurs 1 times.
Term defuse occurs 1 times.
Term And occurs 4 times.
Term advantage occurs 1 times.
Term very occurs 2 times.
Term Industry occurs 3 times.
Term exporters occurs 2 times.
Term Unofficial occurs 1 times.
Term domestically occurs 1 times.
Term Paul occurs 1 times.
Term prevent occurs 1 times.
Term below-cost occurs 1 times.
Term estimates occurs 1 times.
Term cost occurs 1 times.
Term further occurs 1 times.
Term will occurs 3 times.
Term Korea occurs 3 times.
Term spokesman occurs 1 times.
Term new occurs 1 times.
Term firms occurs 1 times.
Term boost occurs 2 times.
Term public occurs 1 times.
Term told occurs 1 times.
Term exchange occurs 1 times.
Term commercial occurs 1 times.
Term Exports occurs 1 times.
Term Up occurs 1 times.
Term 

### We can visualize the distribution of frequency, too.

In [62]:
fdist.plot()

ValueError: The plot function requires matplotlib to be installed.See http://matplotlib.org/

In [None]:
fdist.plot(cumulative=True)

## Text Normalization

### We get text content from many different sources and we want a unified format. Issues of grammaticality, spelling, punctuation, acronyms, weird tokens (e.g. emoticons) and others make this hard. There is not a nifty python package to handle it all for us, but here is an example of one tool to be used in normalizing text - *edit distance*:

In [None]:
from nltk.metrics import *
edit_distance('rain,','shine')

### See also the [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance). There is also [phonemic distance](https://en.wikipedia.org/wiki/Phonetic_algorithm), based on the pronunciation of words.

### You might as well download [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.207.6218&rep=rep1&type=pdf) because I will assign it as reading eventually.

## Representation

### So far, we have looked at basic tools for breaking down text and cleaning it up. In order to plug it into a machine learning algorithm, text will need to be broken down, cleaned up and then encoded in vectors.

- Word2Vec - Skip-Gram/CBOW
- Bag-of-Words
- N-grams

### Using `nltk` to calculate N-grams is a natural next step after tokenization and normalization, for sentiment analysis on tweets, say.

In [73]:
tokens = tt.tokenize(t)

In [None]:
for b in bigrams(tokens):
    print(b)

In [None]:
for r in trigrams(tokens):
    print(r)

In [None]:
for n in ngrams(tokens,4):
    print(n)

### These can then be transformed into numerical vectors using, say, a *one-hot* encoding.

## Classification

### After we have tokenized and cleaned and encoded, what then? Then we want to do classification. We want to learn from the data. We will look at three different methods for this, at least:

- decision trees
- support vector machines
- naïve Bayes

### You can also use neural nets and any other thing that can be made to operate on the representation.

In [None]:
# Example taken from Sklearn docs.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets


def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out


# import some data to play with
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
models = (svm.SVC(kernel='linear', C=C),
          svm.LinearSVC(C=C),
          svm.SVC(kernel='rbf', gamma=0.7, C=C),
          svm.SVC(kernel='poly', degree=3, C=C))
models = (clf.fit(X, y) for clf in models)

# title for the plots
titles = ('SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel')

# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2)
plt.subplots_adjust(wspace=0.4, hspace=0.4)

X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)

for clf, title, ax in zip(models, titles, sub.flatten()):
    plot_contours(ax, clf, xx, yy,
                  cmap=plt.cm.coolwarm, alpha=0.8)
    ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xlabel('Sepal length')
    ax.set_ylabel('Sepal width')
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title(title)

plt.show()

### What is this "kernel" business? Let's look at example to illustrate the basic idea.