In [1]:
import re, math

In [2]:
class Corpora:
    """
    This is a helper class containing simple example corpora and other reference material that can be used with the VectorSemanticsModel class

    Attributes
    N/A
    
    Methods
    saturn(): str
        an example corpus, adapted from Simple English Wikipedia's article about Saturn: https://simple.wikipedia.org/wiki/Saturn
    music(): str
        an example corpus, adapted from Simple English Wikipedia's article on music: https://simple.wikipedia.org/wiki/Music
    functionWords(): list
        a list of function words that the VectorSemanticsModels are designed to ignore as vocabulary by default
    """
    
    @staticmethod
    def saturn():
        """
        "Saturn" example corpus
        
        Parameters
        N/A
            
        Returns
        saturn (str):
            an example corpus, adapted from Simple English Wikipedia's article about Saturn: https://simple.wikipedia.org/wiki/Saturn
        """
        return """Saturn is the sixth planet from the Sun in the Solar System. It is the second largest planet in the Solar System, after Jupiter. Saturn is one of the four gas giant planets, along with Jupiter, Uranus, and Neptune. Inside Saturn is probably a core of iron, nickel, silicon and oxygen compounds, surrounded by a deep layer of metallic hydrogen, then a layer of liquid hydrogen and liquid helium and finally, an outer gaseous layer. Saturn has 67 known moons orbiting the planet. 38 are officially named and 29 are waiting to be named. The largest moon is Titan, which is larger in volume than the planet Mercury. Titan is the second-largest moon in the Solar System. The largest moon is Jupiter's moon, Ganymede. There is also a very large system of rings around Saturn. These rings are made of ice with smaller amounts of rocks and dust. Some people believe that the rings were caused from a moon impact or other event. Saturn is about 1,433,000,000 km (869,000,000 mi) on average from the Sun. Saturn takes 29.6 Earth years to revolve around the Sun. Saturn was named after the Roman god Saturnus (called Kronos in Greek mythology). Saturn's symbol is ♄ which is the symbol of Saturnus' sickle. Saturn is an oblate spheroid, meaning that it is flattened at the poles, and it swells out around its equator. The planet's equatorial diameter is 120,536 km (74,898 mi), while its polar diameter (the distance from the north pole to the south pole) is 108,728 km (67,560 mi); a 9% difference. Saturn has a flattened shape due to its very fast rotation, once every 10.8 hours. Saturn is the only planet in the Solar System that is less dense than water. Even though the planet's core is very dense, it has a gaseous atmosphere, so the average specific density of the planet is 0.69 g/cm3. This means if Saturn could be placed in a large pool of water, it would float. Atmosphere The outer part of Saturn's atmosphere is made up of about 96% hydrogen, 3% helium, 0.4% methane and 0.01% ammonia. There are also very small amounts of acetylene, ethane and phosphine. The hexagonal cloud The north polar hexagonal cloud first found by Voyager 1 and later by Cassini Saturn's clouds show a banded pattern, like the cloud bands seen on Jupiter. Saturn's clouds are much fainter and the bands are wider at the equator. Saturn's lowest cloud layer is made up of water ice, and is about 10 km (6 mi) thick. The temperature here is quite low, at 250 K (-10°F, -23°C). However scientists do not agree about this. The layer above, about 77 km (48 mi) thick, is made up of ammonium hydrosulfide ice, and above that is a layer of ammonia ice clouds 80 km (50 mi) thick. The highest layer is made up of hydrogen and helium gases, which extends between 200 km (124 mi) and 270 km (168 mi) above the water cloud tops. Auroras are also known to form in Saturn in the mesosphere. The temperature at Saturn's cloud tops is extremely low, at 98 K (-283 °F, -175 °C). The temperatures in the inner layers are much higher than the outside layers because of the heat produced by Saturn's interior. Saturn's winds are some of the fastest in the Solar System, reaching 1,800 km/h (1,118 mph), ten times faster than winds on Earth. Storms and spots Saturn's atmosphere is also known to form oval shaped clouds, similar to the clearer spots seen in Jupiter. These oval spots are cyclonic storms, the same as cyclones seen on Earth. In 1990, the Hubble Space Telescope found a very large white cloud near Saturn's equator. Storms like the one in 1990 were known as Great White Spots. These unique storms only exist for a short time and only happen about every 30 Earth years, at the time of the summer solstice in the Northern Hemisphere. Great White Spots were also found in 1876, 1903, 1933, and 1960. If this cycle continues, another storm will form in about 2020. The Voyager 1 spacecraft found a hexagonal cloud pattern near Saturn's north pole at about 78°N. The Cassini−Huygens probe later confirmed it in 2006. Unlike the north pole, the south pole does not show any hexagonal cloud feature. The probe also discovered a hurricane-like storm locked to the south pole that clearly showed an eyewall. Until this discovery, eyewalls had only been seen on Earth. Interior Saturn's interior is similar to Jupiter's interior. It has a small rocky core about the size of the Earth at its center. It is very hot; its temperature reaches 15,000 K (26,540 °F (14,727 °C)). Saturn is so hot that it gives out more heat energy into space than it receives from the Sun. Above it is a thicker layer of metallic hydrogen, about 30,000 km (18,641 mi) deep. Above that layer is a region of liquid hydrogen and helium. The core is heavy, with about 9 to 22 times more mass than the Earth's core. Magnetic field Saturn has a natural magnetic field that is weaker than Jupiter's. Like the Earth's, Saturn's field is a magnetic dipole. Saturn's field is unique in that it is perfectly symmetrical, unlike any other known planet. This means the field is exactly in line with the planet's axis. Saturn generates radio waves, but they are too weak to be detected from Earth. The moon Titan orbits in the outer part of Saturn's magnetic field and gives out plasma to the field from the ionised particles in Titan's atmosphere. Rotation and orbit Saturn's average distance from the Sun is over 1,400,000,000 km (869,000,000 mi), about nine times the distance from the Earth to the Sun. It takes 10,759 days, or about 29.8 years, for Saturn to orbit around the Sun. This is known as Saturn's orbital period. Voyager 1 measured Saturn's rotation as being 10 hours 14 minutes at the equator, 10 hours 40 minutes closer to the poles, and 10 hours 39 minutes 24 seconds for the planet's interior. This is known as its rotational period. Cassini measured the rotation of Saturn as being 10 hours 45 minutes 45 seconds ± 36 seconds. That is about six minutes, or one percent, longer than the radio rotational period measured by the Voyager 1 and Voyager 2 spacecraft, which flew by Saturn in 1980 and 1981. Saturn's rotational period is calculated by the rotation speed of radio waves released by the planet. The Cassini−Huygens spacecraft discovered that the radio waves slowed down, suggesting that the rotational period increased. Since the scientists do not think Saturn's rotation is actually slowing down, the explanation may lie in the magnetic field that causes the radio waves. Planetary rings Main article: Rings of Saturn Saturn is best known for its planetary rings which are easy to see with a telescope. There are seven named rings; A, B, C, D, E, F, and G rings. They were named in the order they were discovered, which is different to their order from the planet. From the planet the rings are: D, C, B, A, F, G and E.:57 Scientists believe that the rings are material left after a moon broke apart.:60 A new idea says that it was a very large moon, most of which crashed into the planet. This left a large amount of ice to form the rings, and also some of the moons, like Enceladus, which are thought to be made of ice.:61 History The rings were first discovered by Galileo Galilei in 1610, using his telescope. They did not look like rings to Galileo, so he called them "handles". He thought that Saturn was three separate planets that almost touched one another. In 1612, when the rings were facing edge on with the Earth, the rings disappeared, then reappeared again in 1613, further confusing Galileo. In 1655, Christiaan Huygens was the first person to recognise Saturn was surrounded by rings. Using a much more powerful telescope than Galilei's, he noted Saturn "is surrounded by a thin, flat, ring, nowhere touching...". In 1675, Giovanni Domenico Cassini discovered that the planet's rings were in fact made of smaller ringlets with gaps. The largest ring gap was later named the Cassini Division. In 1859, James Clerk Maxwell showed that the rings cannot be solid, but are made of small particles, each orbiting Saturn on their own, otherwise, it would become unstable or break apart. James Keeler studied the rings using a spectroscope in 1895 which proved Maxwell's theory. Physical features The rings range from 6,630 km (4,120 mi) to 120,700 km (75,000 mi) above the planet's equator. As proved by Maxwell, even though the rings appear to be solid and unbroken when viewed from above, the rings are made of small particles of rock and ice. They are only about 10 m (33 ft) thick; made of silica rock, iron oxide and ice particles.:55 The smallest particles are only specks of dust while the largest are the size of a house. The C and D rings also seem to have a "wave" in them, like waves in water.:58 These large waves are 500 m (1,640 ft) high, but only moving slowly at about 250 m (820 ft) each day.:58 Some scientists believe that the wave is caused by Saturn's moons. Another idea is the waves were made by a comet hitting Saturn in 1983 or 1984.:60 The largest gaps in the rings are the Cassini Division and the Encke Division, both visible from the Earth. The Cassini Division is the largest, measuring 4,800 km (2,983 mi) wide. However, when the Voyager spacecrafts visited Saturn in 1980, they discovered that the rings are a complex structure, made out of thousands of thin gaps and ringlets. Scientists believe this is caused by the gravitational force of some of Saturn's moons. The tiny moon Pan orbits inside Saturn's rings, creating a gap within the rings. Other ringlets keep their structure due to the gravitational force of shepherd satellites, such as Prometheus and Pandora. Other gaps form due to the gravitational force of a large moon farther away. The moon Mimas is responsible for clearing away the Cassini gap. Recent data from the Cassini spacecraft has shown that the rings have their own atmosphere, free from the planet's atmosphere. The rings' atmosphere is made of oxygen gas, and it is produced when the Sun's ultraviolet light breaks up the water ice in the rings. Chemical reaction also occurs between the ultraviolet light and the water molecules, creating hydrogen gas. The oxygen and hydrogen atmospheres around the rings are very widely spaced. As well as oxygen and hydrogen gas, the rings have a thin atmosphere made of hydroxide. This anion was discovered by the Hubble Space Telescope. Spokes The spokes in Saturn's rings The spokes in Saturn's rings, photographed by Voyager 2 The Voyager space probe discovered features shaped like rays, called spokes. These were also seen later by the Hubble telescope. The Cassini probe photographed the spokes in 2005. They are seen as dark when under sunlight, and appear light when against the unlit side. At first it was thought the spokes were made of microscopic dust particles but new evidence shows that they are made of ice. They rotate at the same time with the planet's magnetosphere, therefore, it is believed that they have a connection with electromagnetism. However, what causes the spokes to form is still unknown. They appear to be seasonal, disappearing during solstice and appearing again during equinox. Moons Saturn has 53 named moons, and another nine which are still being studied. Many of the moons are very small: 33 are less than 10 km (6 mi) in diameter and 13 moons are less than 50 km (31 mi). Seven moons are large enough to be a near perfect sphere caused by their own gravitation. These moons are Titan, Rhea, Iapetus, Dione, Tethys, Enceladus and Mimas. Titan is the largest moon, larger than the planet Mercury, and it is the only moon in the Solar System to have a thick, dense atmosphere. Hyperion and Phoebe are the next largest moons, larger than 200 km (124 mi) in diameter. In December 2004 and January 2005 a man-made satellite called the Cassini−Huygens probe took lots of close photos of Titan. One part of this satellite, known as the Huygens probe, then landed on Titan. Named after the Dutch astronomer Christiaan Huygens, it was the first spacecraft to land in the outer Solar System. The probe was designed to float in case it landed in liquid. Enceladus, the sixth largest moon, is about 500 km (311 mi) in diameter. It is one of the few outer solar system objects that shows volcanic activity. In 2011, scientists discovered an electric link between Saturn and Enceladus. This is caused by ionised particles from volcanos on the small moon interacting with Saturn's magnetic fields. Similar interactions cause the northern lights on Earth. Exploration Saturn from Cassini orbiter Saturn as seen from the Cassini spacecraft in 2007 Saturn was first explored by the Pioneer 11 spacecraft in September 1979. It flew as close as 20,000 km (12,427 mi) above the planet's cloud tops. It took photographs of the planet and a few of its moons, but were low in resolution. It discovered a new, thin ring called the F ring. It also discovered that the dark ring gaps appear bright when viewed towards the Sun, which shows the gaps are not empty of material. The spacecraft measured the temperature of the moon Titan. In November 1980, Voyager 1 visited Saturn, and took higher resolution photographs of the planet, rings and moons. These photos were able to show the surface features of the moons. Voyager 1 went close to Titan, and gained much information about its atmosphere. In August, 1981, Voyager 2 continued to study the planet. Photos taken by the space probe showed that changes were happening to the rings and atmosphere. The Voyager spacecrafts discovered a number of moons orbiting close to Saturn's rings, as well as discovering new ring gaps. Drawing of Cassini in orbit around Saturn On July 1, 2004, the Cassini−Huygens probe entered into orbit around Saturn. Before then, it flew close to Phoebe, taking very high resolution photos of its surface and collecting data. On December 25, 2004, the Huygens probe separated from the Cassini probe before moving down towards Titan's surface and landed there on January 14, 2005. It landed on a dry surface, but it found that large bodies of liquid exist on the moon. The Cassini probe continued to collect data from Titan and a number of the icy moons. It found evidence that the moon Enceladus had water erupting from its geysers. Cassini also proved, in July 2006, that Titan had hydrocarbon lakes, located near its north pole. In March 2007, it discovered a large hydrocarbon lake the size of the Caspian Sea near its north pole. Cassini observed lightning occurring in Saturn since early 2005. The power of the lightning was measured to be 1,000 times more powerful than lightning on Earth. Astronomers believe that the lightning observed in Saturn is the strongest ever seen."""
    
    @staticmethod
    def music():
        """
        "Music" example corpus
        
        Parameters
        N/A
            
        Returns
        music (str):
            an example corpus, adapted from Simple English Wikipedia's article on music: https://simple.wikipedia.org/wiki/Music
        """
        return """Horn Music Music is a form of art that uses sound organised in time. Music is also a form of entertainment that puts sounds together in a way that people like, find interesting or dance to. Most music includes people singing with their voices or playing musical instruments, such as the piano, guitar, drums or violin. The word music comes from the Greek word (mousike), which means  (art) of the Muses . In Ancient Greece the Muses included the goddesses of music, poetry, art, and dance. Someone who makes music is known as a musician. Definition of music Music is sound that has been organized by using rhythm, melody or harmony. If someone bangs saucepans while cooking, it makes noise. If a person bangs saucepans or pots in a rhythmic way, they are making a simple type of music. There are four things which music has most of the time: Music often has pitch. This means high and low notes. Tunes are made of notes that go up or down or stay on the same pitch. Music often has rhythm. Rhythm is the way the musical sounds and silences are put together in a sequence. Every tune has a rhythm that can be tapped. Music usually has a regular beat. Music often has dynamics. This means whether it is quiet or loud or somewhere in between. Music often has timbre. This is a French word (pronounced the French way:  TAM-br ). The  timbre  of a sound is the way that a sound is interesting. The sort of sound might be harsh, gentle, dry, warm, or something else. Timbre is what makes a clarinet sound different from an oboe, and what makes one person's voice sound different from another person. Definitions There is no simple definition of music which covers all cases. It is an art form, and opinions come into play. Music is whatever people think is music. A different approach is to list the qualities music must have, such as, sound which has rhythm, melody, pitch, timbre, etc. These and other attempts, do not capture all aspects of music, or leave out examples which definitely are music. According to Thomas Clifton, music is  a certain reciprocal relation established between a person, his behavior, and a sounding object .p10 Musical experience and the music, together, are called phenomena, and the activity of describing phenomena is called phenomenology. History Musicians of Amun, Tomb of Nakht, 18th Dynasty, Western Thebes Even in the stone age people made music. The first music was probably made trying to imitate sounds and rhythms that occurred naturally. Human music may echo these phenomena using patterns, repetition and tonality. This kind of music is still here today. Shamans sometimes imitate sounds that are heard in nature. It may also serve as entertainment (games), or have practical uses, like attracting animals when hunting. Some animals also can use music. Songbirds use song to protect their territory, or to attract a mate. Monkeys have been seen beating hollow logs. This may, of course, also serve to defend the territory. The first musical instrument used by humans was probably the voice. The human voice can make many different kinds of sounds. The larynx (voice box) is like a wind instrument. The oldest known Neanderthal hyoid bone with the modern human form was found in 1983, indicating that the Neanderthals had language, because the hyoid supports the voice box in the human throat. Most likely the first rhythm instruments or percussion instruments involved the clapping of hands, stones hit together, or other things that are useful to keep a beat. There are finds of this type that date back to the paleolithic. Some of these are ambiguous, as they can be used either as a tool or a musical instrument. The first flutes The Divje Babe flute The oldest flute ever discovered may be the so-called Divje Babe flute, found in the Slovenian cave Divje Babe I in 1995. It is not certain that the object is really a flute. The item in question is a fragment of the femur of a young cave bear, and has been dated to about 43,000 years ago. However, whether it is truly a musical instrument or simply a carnivore-chewed bone is a matter of ongoing debate. In 2008, archaeologists discovered a bone flute in the Hohle Fels cave near Ulm, Germany. The five-holed flute has a V-shaped mouthpiece and is made from a vulture wing bone. The researchers involved in the discovery officially published their findings in the journal Nature, in June 2009. The discovery is also the oldest confirmed find of any musical instrument in history. Other flutes were also found in the cave. This flute was found next to the Venus of Hohle Fels and a short distance from the oldest known human carving. When they announced their discovery, the scientists suggested that the  finds demonstrate the presence of a well-established musical tradition at the time when modern humans colonized Europe . The oldest known wooden pipes were discovered near Greystones, Ireland, in 2004. A wood-lined pit contained a group of six flutes made from yew wood, between 30 and 50 cm long, tapered at one end, but without any finger holes. They may once have been strapped together. In 1986 several bone flutes were found in Jiahu in Henan Province, China. They date to about 6,000 BC. They have between 5 and 8 holes each and were made from the hollow bones of a bird, the Red-crowned Crane. At the time of the discovery, one was found to be still playable. The bone flute plays both the five- or seven-note scale of Xia Zhi and six-note scale of Qing Shang of the ancient Chinese musical system. Ancient times It is not known what the earliest music of the cave people was like. Some architecture, even some paintings, are thousands of years old, but old music could not survive until people learned to write it down. The only way we can guess about early music is by looking at very old paintings that show people playing musical instruments, or by finding them in archaeological digs (digging underground to find old things). The earliest piece of music that was ever written down and that has not been lost was discovered on a tablet written in Hurrian, a language spoken in and around northern Mesopotamia (where Iraq is today), from about 1500 BC. The Oxfords Companion to Music, ed. Percy Scholes, London 1970 Middle Ages Another early piece of written music that has survived was a round called Sumer Is Icumen In. It was written down by a monk around the year 1250. Much of the music in the Middle Ages (roughly 450-1420) was folk music played by working people who wanted to sing or dance. When people played instruments, they were usually playing for dancers. However, most of the music that was written down was for the Catholic church. This music was written for monks to sing in church. It is called Chant (or Gregorian chant). Renaissance In the Renaissance (roughly 1400–1550) there was a lot of music, and many composers wrote music that has survived so that it can be performed, played or sung today. The name for this period (Renaissance) is a French word which means  rebirth . This period was called the  rebirth  because many new types of art and music were reborn during this time. Some very beautiful music was written for use in church services (sacred music) by the Italian composer Giovanni da Palestrina (1525–1594). In Palestrina's music, many singers sing together (this is called a choir). There was also plenty of music not written for the church, such as happy dance music and romantic love songs. Popular instruments during the Renaissance included the viols (a string instrument played with a bow), lutes (a plucked stringed instrument that is a little like a guitar), and the virginal, a small, quiet keyboard instrument. Baroque In the arts, the Baroque was a Western cultural era, which began near the turn of the 17th century in Rome. It was exemplified by drama and grandeur in sculpture, painting, literature, dance, and music. In music, the term 'Baroque' applies to the final period of dominance of imitative counterpoint, where different voices and instruments echo each other but at different pitches, sometimes inverting the echo, and even reversing thematic material. The popularity and success of the Baroque style was encouraged by the Roman Catholic Church which had decided at the time of the Council of Trent that the arts should communicate religious themes in direct and emotional involvement. The upper class also saw the dramatic style of Baroque architecture and art as a means of impressing visitors and expressing triumphant power and control. Baroque palaces are built around an entrance of courts, grand staircases and reception rooms of sequentially increasing opulence. In similar profusions of detail, art, music, architecture, and literature inspired each other in the Baroque cultural movement as artists explored what they could create from repeated and varied patterns. Some traits and aspects of Baroque paintings that differentiate this style from others are the abundant amount of details, often bright polychromy, less realistic faces of subjects, and an overall sense of awe, which was one of the goals in Baroque art. The word baroque probably derives from the ancient Portuguese noun  barroco  which is a pearl that is not round but of unpredictable and elaborate shape. Hence, in informal usage, the word baroque can simply mean that something is  elaborate , with many details, without reference to the Baroque styles of the seventeenth and eighteenth centuries. Classical period In western music, the classical period means music from about 1750 to 1825. It was the time of composers like Joseph Haydn, Wolfgang Amadeus Mozart and Ludwig van Beethoven. Orchestras became bigger, and composers often wrote longer pieces of music called symphonies that had several sections (called movements). Some movements of a symphony were loud and fast; other movements were quiet and sad. The form of a piece of music was very important at this time. Music had to have a nice 'shape'. They often used a structure which was called sonata form. Another important type of music was the string quartet, which is a piece of music written for two violins, a viola, and a violoncello. Like symphonies, string quartet music had several sections. Haydn, Mozart and Beethoven each wrote many famous string quartets. The piano was invented during this time. Composers liked the piano, because it could be used to play dynamics (getting louder or getting softer). Other popular instruments included the violin, the violoncello, the flute, the clarinet, and the oboe. Romantic period The 19th century is called the Romantic period. Composers were particularly interested in conveying their emotions through music. An important instrument from the Romantic period was the piano. Some composers, such as Frederic Chopin wrote subdued, expressive, quietly emotional piano pieces. Often music described a feeling or told a story using sounds. Other composers, such as Franz Schubert wrote songs for a singer and a piano player called Lied (the German word for  song ). These Lieder (plural of Lied) told stories by using the lyrics (words) of the song and by the imaginative piano accompaniments. Other composers, like Richard Strauss, and Franz Liszt created narratives and told stories using only music, which is called a tone poem. Composers, such as Franz Liszt and Johannes Brahms used the piano to play loud, dramatic, strongly emotional music. Many composers began writing music for bigger orchestras, with as many as 100 instruments. It was the period of  Nationalism  (the feeling of being proud of one's country) when many composers made music using folksong or melodies from their country. Lots of famous composers lived at this time such as Franz Schubert, Felix Mendelssohn, Frederic Chopin, Johannes Brahms, Pyotr Tchaikovsky and Richard Wagner. Modern times From about 1900 onwards is called the  modern period . Many 20th century composers wanted to compose music that sounded different from the Classical and Romantic music. Modern composers searched for new ideas, such as using new instruments, different forms, different sounds, or different harmonies. The composer Arnold Schoenberg (1874–1951) wrote pieces which were atonal (meaning that they did not sound as if they were in any clear musical key). Later, Schoenberg invented a new system for writing music called twelve-tone system. Music written with the twelve-tone system sounds strange to some, but is mathematical in nature, often making sense only after careful study. Pure twelve-tone music was popular among academics in the fifties and sixties, but some composers such as Benjamin Britten use it today, when it is necessary to get a certain feel. One of the most important 20th-century composers, Igor Stravinsky (1882–1971), wrote music with very complicated (difficult) chords (groups of notes that are played together) and rhythms. Some composers thought music was getting too complicated and so they wrote Minimalist pieces which use very simple ideas. In the 1950s and 1960s, composers such as Karlheinz Stockhausen experimented with electronic music, using electronic circuits, amplifiers and loudspeakers. In the 1970s, composers began using electronic synthesizers and musical instruments from rock and roll music, such as the electric guitar. They used these new instruments to make new sounds. Composers writing in the 1990s and the 2000s, such as John Adams (born 1947) and James MacMillan (born 1959) often use a mixture of all these ideas, but they like to write tonal music with easy tunes as well. Electronic music Music can be produced electronically. This is most commonly done by computers, keyboards, electric guitars and disk tables. They can mimic traditional instruments, and also produce very different sounds. 21st-century electronic music is commonly made with computer programs and hardware mixers. Jazz Jazz is a type of music that was invented around 1900 in New Orleans in the south of the USA. There were many black musicians living there who played a style of music called blues music. Blues music was influenced by African music (because the black people in the United States had come to the United States as slaves. They were taken from Africa by force). Blues music was a music that was played by singing, using the harmonica, or the acoustic guitar. Many blues songs had sad lyrics about sad emotions (feelings) or sad experiences, such as losing a job, a family member dying, or having to go to jail (prison). Jazz music mixed together blues music with European music. Some black composers such as Scott Joplin were writing music called ragtime, which had a very different rhythm from standard European music, but used notes that were similar to some European music. Ragtime was a big influence on early jazz, called Dixieland jazz. Jazz musicians used instruments such as the trumpet, saxophone, and clarinet were used for the tunes (melodies), drums for percussion and plucked double bass, piano, banjo and guitar for the background rhythm (rhythmic section). Jazz is usually improvised: the players make up (invent) the music as they play. Even though jazz musicians are making up the music, jazz music still has rules; the musicians play a series of chords (groups of notes) in order. Jazz music has a swinging rhythm. The word  swing  is hard to explain. For a rhythm to be a  swinging rhythm  it has to feel natural and relaxed. Swing rhythm is not even like a march. There is a long-short feel instead of a same-same feel. A  swinging rhythm  also gets the people who are listening excited, because they like the sound of it. Some people say that a  swinging rhythm  happens when all the jazz musicians start to feel the same pulse and energy from the song. If a jazz band plays very well together, people will say  that is a swinging jazz band  or  that band really swings well.  Jazz influenced other types of music like the Western art music from the 1920s and 1930s. Art music composers such as George Gershwin wrote music that was influenced by jazz. Jazz music influenced pop music songs. In the 1930s and 1940s, many pop music songs began using chords or melodies from jazz songs. One of the best known jazz musicians was Louis Armstrong (1900–1971). Pop music Main article: Pop music  Pop  music is a type of popular music that many people like to listen to. The term  pop music  can be used for all kinds of music that was written to be popular. The word  pop music  was used from about 1880 onwards, when a type of music called music was popular. Modern pop music grew out of 1950's rock and roll, (for example Chuck Berry, Bo Diddley and Little Richard) and rockabilly (for example Elvis Presley and Buddy Holly). In the 1960s, The Beatles became a famous pop music group. In the 1970s, other styles of music were mixed with pop music, such as funk and soul music. Pop music generally has a heavy (strong) beat, so that it is good for dancing. Pop singers normally sing with microphones that are plugged into an amplifier and a loudspeaker. Musical notation Main article: Musical notation Mozart : First movement of the piano sonata K545 – an example of writing music in staffs  Musical notation  is the way music is written down. Music needs to be written down in order to be saved and remembered for future performances. In this way composers (people who write music) can tell others how to play the musical piece as it was meant to be played. Solfège Solfège (sometimes called solfa) is the way tones are named. It was made in order to give a name to the several tones and pitches. For example, the eight basic notes  Do, Re, Mi, Fa, So, La, Ti, Do  are just the names of the eight notes that confirm the major scale. Written music Music can be written in several ways. When it is written on a staff (like in the example shown), the pitches (tones) and their duration are represented by symbols called notes. Notes are put on the lines and in the spaces between the lines. Each position says which tone must be played. The higher the note is on the staff, the higher the pitch of the tone. The lower the notes are, the lower the pitch. The duration of the notes (how long they are played for) is shown by making the note  heads  black or white, and by giving them stems and flags. Music can also be written with letters, naming them as in the solfa  Do, Re, Mi, Fa, So, La, Ti, Do  or representing them by letters. The next table shows how each note of the solfa is represented in the Standard Notation: Solfa Name Standard Notation Do C Re D Mi E Fa F So G La A Ti B The Standard Notation was made to simplify the lecture of music notes, although it is mostly used to represent chords and the names of the music scales. These ways to represent music ease the way a person reads music. There are more ways to write and represent music, but they are less known and may be more complicated. How to enjoy music By listening People can enjoy music by listening to it. They can go to concerts to hear musicians perform. Classical music is usually performed in concert halls, but sometimes huge festivals are organized in which it is performed outside, in a field or stadium, like pop festivals. People can listen to music on CD's, Computers, iPods, television, the radio, casett. record-players and even mobile phones. There is so much music today, in elevators, shopping malls, and stores, that it often becomes a background sound that we do not really hear. By playing or singing People can learn to play an instrument. Probably the most common for complete beginners is the piano or keyboard, the guitar, or the recorder (which is certainly the cheapest to buy). After they have learnt to play scales, play simple tunes and read the simplest musical notation, then they can think about which instrument for further development. They should choose an instrument that is practical for their size. For example, a very short child cannot play a full size double bass, because the double bass is over five feet high. People should choose an instrument that they enjoy playing, because playing regularly is the only way to get better. Finally, it helps to have a good teacher. By composing Anyone can make up his or her own pieces of music. It is not difficult to compose simple songs or melodies (tunes). It's easier for people who can play an instrument themselves. All it takes is experimenting with the sounds that an instrument makes. Someone can make up a piece that tells a story, or just find a nice tune and think about ways it can be changed each time it is repeated. The instrument might be someone's own voice. The fact is, there are tons of instruments in the world."""

    @staticmethod
    def functionWords():
        """
        List of function words that can be ignored by the VectorSemanticsModel
        """
        return ["are", "that", "to", "was", "and", "a", "the", "that", "in", "of", "is", "it", "which", "or", "from", "many"]

In [3]:
class WordVector:
    """
    This is a helper class that handles the vector operations for the VectorSemanticsModel class

    Attributes
    components: int list
        the list of numerical components of the vector
    
    Methods
    length(): float
        the length of the vector in n-space
    dot(other): float
        the dot product of the vector with the other vector
    """
    
    def __init__(self, source):
        """
        Constructs all neccesary attributes for the WordVector object
        
        Parameters
        source: list
            the source list containing the numberical components of the WordVector
        """
        self.components = source
    
    def length(self):
        """
        Calculates and returns the length of the vector in n-space
        
        Parameters
        None
            
        Returns
        length (float):
            the length of the vector in n-space
        """
        return math.sqrt(sum([component*component for component in self.components]))
    
    def dot(self, other):
        """
        Calculates and returns the dot product of the vector with the other vector
        Prints an error statement if the vectors are of different dimensions.
        
        Parameters
        other: WordVector
            the other vector, whose dot product with this vector is to be found
            
        Returns
        dotProduct (float):
            the dot product of the vector with the other vector
        """
        if len(self.components) != len(other.components):
            print("Error: vectors have different number of components!")#This error SHOULDN'T ever happen...
            return None
        else:
            return sum([self.components[i] * other.components[i] for i in range(0, len(self.components) - 1)])

In [4]:
class VectorSemanticsModel:
    """
    This class generates a simple sentence-level co-occurrence-based Vector Semantics model from a string Corpus.

    Attributes
    corpus: str
        the corpus that the co-occurrence model is based on
    sentences : str list
        the corpus, separated into individual sentences
    vocab: str set
        each word from the original corpus that is recognized as a vocabulary word
    dimensions:
        each word that can act as a measurement dimension for word co-occurrence
    mappings: dict
        a dictionary mapping each word in vocab to a dictionary mapping each word in dimensions to a co-occurrence frequency
    ignoredVocab: str set
        each word that was in the original corpus, but that the user removed from the vocab set
    ignoredMappings: dict
        a dictionary containing each dictionary that the user removed from the mappings dictionary
          
    Methods
    printTable():
        prints a complete co-occurence table for all words in the VectorSemanticsModel
    compare(word1, word2, toPrint = False):
        compares two words in the dictionary, giving their similarity score
    mostSimilar(word):
        compares the word against all other words, giving the most similar word
    ignore(word, hardForget = False):
        allows the user to remove a word or list of words from the VectorSemanticsModel's vocabulary
    relearnWord(word):
        allows the user to return a previously ignored word into the VectorSemanticsModel's vocabulary
    """
    
    def __init__(self, source, ignoreWords = False):
        """
        Constructs all neccesary attributes for the VectorSemanticsModel object
        
        Parameters
        source: str
            the source string to be used as a corpus by the VectorSemanticsModel
        ignoreWords: boolean
            if True, ignore function words in Corpora.functionWords()
        """
        self.corpus = source
        self.sentences = re.split(r'[.!?]', re.sub(r'[-;,:\"\'\\)(%$~@^*]', "", self.corpus.lower())) #remove extra symbols
        self.mappings = {}
        self.vocab = set()
        self.dimensions = set()
        self.ignoredVocab = set() #holds words that are being temporarily ignored
        self.ignoredMappings = {}
        for sentence in self.sentences:
            sentenceWords = re.findall(r'\S+', sentence)
            for word1 in sentenceWords:
                if self.mappings.get(word1) == None:
                    self.mappings[word1] = {}
                    self.vocab.add(word1)
                    self.dimensions.add(word1)
                for word2 in sentenceWords:
                    if word1 != word2:
                        if self.mappings.get(word1).get(word2) == None:
                            self.mappings[word1][word2] = 1
                        else:
                            self.mappings[word1][word2] += 1
        for word1 in self.vocab: #setting None to 0
            for word2 in self.dimensions:
                if self.mappings.get(word1).get(word2) == None:
                    self.mappings[word1][word2] = 0
        if ignoreWords:
            self.ignore(Corpora.functionWords())

    def printTable(self):  
        """
        Prints a complete co-occurence table for all words in the VectorSemanticsModel
        
        Parameters
        N/A
        
        Returns
        None
        """
        print("\t", end = "")
        for word in self.dimensions:
            print(word, end = "\t")
        print("", end = "\n")
        for word1 in self.vocab:
            print(word1, end="\t")
            for word2 in self.dimensions:
                s = ""
                if word1 == word2:
                    s = "X" #don't co-occurr a word with itself
                else:
                    s = self.mappings.get(word1).get(word2)
                print(s, end = "\t")
            print("", end = "\n")

    def compare(self, word1, word2, toPrint = False):
        """
        Compares two words in the dictionary, giving their similarity score
        Prints an error message if either word is not recognized
        
        Parameters
        word1: str
            the first word to compare
        word2: str
            the second word
        toPrint: boolean, optional
            if True, this will print out a message about the similarity of the two words
            
        Returns
        score (float): 
            similarity rating of word1 and word2, calculated as the cosine of the angle between word1 and word2
        """
        word1 = word1.lower()
        word2 = word2.lower()
        if word1 in self.vocab and word2 in self.vocab:
            v1 = WordVector([self.mappings[word1][dimension] for dimension in self.dimensions if dimension != word1 and dimension != word2])
            v2 = WordVector([self.mappings[word2][dimension] for dimension in self.dimensions if dimension != word1 and dimension != word2])
            score = v1.dot(v2)/v1.length()/v2.length() #similarity score is the cosine of the angle between the vectorss
            if toPrint:
                print("\"", word1, "\"-\"", word2, "\" similarity score: ", score, sep="")
            return score
        else:
            print("Error: I don't recognize both of those words.")
        
    def mostSimilar(self, word, toPrint = False):
        """
        Compares the word against all other words, giving the most similar word
        Prints an error message if the word is not recognized
        
        Parameters
        word: str
            The word to be compared against the vocabulary
        toPrint: boolean, optional
            if True, this will print out a message about the most similar word and its rating
            
        Returns
        mostSimilar (str): 
            the word in the vocabulary that is most similar to the given word
        """
        if word in self.vocab:
            similarity = 0
            mostSimilar = ""
            for other in self.vocab:
                if other != word:
                    newSimilarity = self.compare(word, other)
                    if newSimilarity > similarity:
                        similarity = newSimilarity
                        mostSimilar = other
            if toPrint:
                print("The most similar word to \"", word, "\" is: \"", mostSimilar, "\" (similarity:", similarity, ")", sep = "" )
            return mostSimilar
        else:
            print("Error: I don't recognize this word!")
            
    def ignore(self, words = Corpora.functionWords(), hardForget = False):
        """
        Allows the user to remove a word from the VectorSemanticsModel's vocabulary
        The "ignored" words are still contained in the model and can be "relearned" later
        
        Parameters
        words: str list, optional
            a list of words to be ignored by the vocabulary
        hardForget: boolean, optional
            if True, the word will be entirely removed from the mappings dictionary and as a dimension, and the word will no longer be relearnable
        Returns
        None
        """
        if isinstance(words, str):#in case of deleting a single word
            words = [words]
        for word in words:
            word = word.lower()
            if word in self.vocab:
                self.vocab.remove(word)    
                ignoredMapping = self.mappings.pop(word)
                if hardForget:
                    for word2 in self.dimensions:
                        self.mappings[word2].pop(word)
                else:
                    self.ignoredMappings[word] = ignoredMapping
                    self.ignoredVocab.add(word)
    
    def relearnWord(self, word):
        """
        Allows the user to return a removed word into the VectorSemanticsModel's vocabulary
        
        Parameters
        word: str
            the word to be relearned
            
        Returns
        None
        """
        if word not in self.ignoredVocab:
            print("Sorry, that word is completely unfamiliar!")
        else:
            self.mappings[word] = self.ignoredMappings.pop(word)
            self.vocab.add(word)
            self.ignoredVocab.remove(word)

# DEMONSTRATIONS:

In [5]:
#loading both corpora into memory
saturn = VectorSemanticsModel(Corpora.saturn())
music = VectorSemanticsModel(Corpora.music(), ignoreWords = True)

In [6]:
#let's see what is most similar to water according to the Saturn corpus:
saturn.mostSimilar("water", toPrint = True)
#this isn't very interesting. Lets remove some common words, so we get content words instead of function words:

The most similar word to "water" is: "is" (similarity:0.8431560808150893)


'is'

In [7]:
saturn.ignore()
saturn.mostSimilar("water", toPrint = True)
#interesting! For reference, the word "ultraviolet" occurs TWICE in the original article, both in connection to a process 
# by which ultraviolet light from the sun breaks down water in the rings to form an oxygen atmosphere. This is the limitation
# of using such a short corpus.

The most similar word to "water" is: "ultraviolet" (similarity:0.8232672460135735)


'ultraviolet'

In [8]:
#Let's try another word:
saturn.mostSimilar("light", toPrint = True)
#I knew this was coming. "light" appears 11 times in the original article, of which TWO times come immediately after the word
# "ultraviolet"

The most similar word to "light" is: "ultraviolet" (similarity:0.9280095240665728)


'ultraviolet'

In [9]:
#Let's see how similar these two words are!
saturn.compare("water", "light", toPrint = True)
#They are still somewhat similar, but slightly more different than the above examples. We knew this was coming because they 
# can't be closer to each other than they are to their own closest words. This is also semantically appropriate, because the 
# ideas of water and light are not very similar. Really, this model considers them to be MORE SIMILAR than they really are, 
# because its background model knows more about the formation of an oxygen atmosphere on Saturn's rings than a normal person's
# which thinks of light as the stuff that makes you see and of water as the stuff you drink

"water"-"light" similarity score: 0.7657885134337987


0.7657885134337987

In [10]:
#Let's try the music corpus:
music.mostSimilar("music", toPrint = True)
#this corpus has already been treated to ignore basic function words

The most similar word to "music" is: "written" (similarity:0.834360159761038)


'written'

In [11]:
#Let's try the ignore functionality to get another similar word. There are better ways to do this, but 
# in my simple implementation, this is the best way to get the second most-similar word

In [12]:
music.ignore("written")
music.mostSimilar("music", toPrint = True)

The most similar word to "music" is: "played" (similarity:0.8237626120640137)


'played'

In [13]:
#You can also make a custom corpus:
spotCorpus = "See Dick. See Jane. See Dick and Jane. Dick and Jane run. See Spot. See Dick and Jane run with Spot. See them run."
spotVSM = VectorSemanticsModel(spotCorpus)
spotVSM.compare("Dick", "Jane", toPrint = True)

"dick"-"jane" similarity score: 1.0000000000000002


1.0000000000000002