# Sprawozdanie - wyszukiwarka

## I) Dane

#### Charakterystyka
* Zbiór artykułów z amerykańskich gazet - CNN, Reuters, New York Times, etc.
* [Data set z kaggle](https://www.kaggle.com/snapcrack/all-the-news) ponad 150K artykułów
* Ze względu na czas przetworzenia danych wykorzystałem 1000 artykułów

#### Przechowanie
* Baza danych SQLite3 i pliki typu CSV
* Jeden wiersz odpowiada 1 artykułowi
* Nagłówek tabeli : __`id, title, publication, author, date, year, month, url, content`__

#### Dostęp
* Pobieranie danych realizowałem poprzez zapytanie SQL : __`SELECT id, content FROM articles`__

#### Przykładowy artykuł
"WASHINGTON - Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administrations authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been demanding an end to the law for years. In another twist, Donald J. Trump's administration, worried about preserving executive branch prerogatives, could choose to fight its Republican allies in the House on some central questions in the dispute. Eager to avoid an ugly political pileup, Republicans on Capitol Hill and the Trump transition team are gaming out how to handle the lawsuit, which, after the election, has been put in limbo until at least late February by the United States Court of Appeals for the District of Columbia Circuit. They are not yet ready to divulge their strategy. Given that this pending litigation involves the Obama administration and Congress, it would be inappropriate to comment, said Phillip J. Blando, a spokesman for the Trump transition effort. Upon taking office, the Trump administration will evaluate this case and all related aspects of the Affordable Care Act. In a potentially   decision in 2015, Judge Rosemary M. Collyer ruled that House Republicans had the standing to sue the executive branch over a spending dispute and that the Obama administration had been distributing the health insurance subsidies, in violation of the Constitution, without approval from Congress. The Justice Department, confident that Judge Collyer's decision would be reversed, quickly appealed, and the subsidies have remained in place during the appeal. In successfully seeking a temporary halt in the proceedings after Mr. Trump won, House Republicans last month told the court that they and the transition team currently are discussing potential options for resolution of this matter, to take effect after the inauguration on Jan. 20, 2017. The suspension of the case, House lawyers said, will provide the   and his future administration time to consider whether to continue prosecuting or to otherwise resolve this appeal. Republican leadership officials in the House acknowledge the possibility of cascading effects if the   payments, which have totaled an estimated $13 billion, are suddenly stopped. Insurers that receive the subsidies in exchange for paying    costs such as deductibles and   for eligible consumers could race to drop coverage since they would be losing money. Over all, the loss of the subsidies could destabilize the entire program and cause a lack of confidence that leads other insurers to seek a quick exit as well. Anticipating that the Trump administration might not be inclined to mount a vigorous fight against the House Republicans given the dim view of the health care law, a team of lawyers this month sought to intervene in the case on behalf of two participants in the health care program. In their request, the lawyers predicted that a deal between House Republicans and the new administration to dismiss or settle the case will produce devastating consequences for the individuals who receive these reductions, as well as for the nations health insurance and health care systems generally. No matter what happens, House Republicans say, they want to prevail on two overarching concepts: the congressional power of the purse, and the right of Congress to sue the executive branch if it violates the Constitution regarding that spending power. House Republicans contend that Congress never appropriated the money for the subsidies, as required by the Constitution. In the suit, which was initially championed by John A. Boehner, the House speaker at the time, and later in House committee reports, Republicans asserted that the administration, desperate for the funding, had required the Treasury Department to provide it despite widespread internal skepticism that the spending was proper. The White House said that the spending was a permanent part of the law passed in 2010, and that no annual appropriation was required even though the administration initially sought one. Just as important to House Republicans, Judge Collyer found that Congress had the standing to sue the White House on this issue a ruling that many legal experts said was flawed and they want that precedent to be set to restore congressional leverage over the executive branch. But on spending power and standing, the Trump administration may come under pressure from advocates of presidential authority to fight the House no matter their shared views on health care, since those precedents could have broad repercussions. It is a complicated set of dynamics illustrating how a quick legal victory for the House in the Trump era might come with costs that Republicans never anticipated when they took on the Obama White House.".

#### Przetworzenie
* Podczas kolejnych etapów przetawrzania danych zapisywałem wyniki w __`csv`__ i dyskowych tablicach numpy __`npy`__
* Tablice numpy były wydajniejsze od plików csv
* W __1000__ przetwarzanych dokumentów, było __24474__ unikalnych wyrazów
* W __5000__ przetwarzanych dokumentów, było __52757__ unikalnych wyrazów

#### 1) Usunięcie symboli niealfanumerycznych i liczb oraz przetworzenie liter na małe pisane

In [5]:
import re
def get_words(raw_text):
    text = raw_text
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    letters_only = re.sub(' {2}', ' ', letters_only)
    return letters_only.lower()

get_words("Sample sentence, preproccesing! @ '2017'")

'sample sentence preproccesing     '

#### 2) Rozdzielenie na słowa :  `stemming, lematization, stop_words`

In [12]:
import spacy
import nltk
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load('my_model')

def stem(words):
    doc = words.split()
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(token) for token in doc]

def lematize(words):
    doc = nlp(words)
    return [token.lemma_ for token in doc]

In [14]:
stem('I was suprised that meeting went exceptionally well')

['I', 'wa', 'supris', 'that', 'meet', 'went', 'except', 'well']

In [15]:
lematize('I was suprised that meeting went exceptionally well')

['-PRON-', 'be', 'supris', 'that', 'meeting', 'go', 'exceptionally', 'well']

In [19]:
print(['I', 'wa', 'supris', 'that', 'meet', 'went', 'except', 'well'])
print ([w for w in ['I', 'wa', 'supris', 'that', 'meet', 'went', 'except', 'well'] if not w in STOP_WORDS])
STOP_WORDS

['I', 'wa', 'supris', 'that', 'meet', 'went', 'except', 'well']
['I', 'wa', 'supris', 'meet', 'went']


{'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'g

#### 3) Stworzenie :  `termSet, bag-of-words, vector`

In [20]:
def makeTermSet(self):
    ts = set()
    docs = [d[1] for d in self.data]
    for doc in docs:
        words = doc.split()
        for word in words:
            if (len(word) > 1):
                ts.add(word)
    return sorted(ts)

In [None]:
def makeTermCount(self):
    termCount = {k: 0 for k in self.termSet}
    docs = [d[1] for d in self.data]
    for doc in docs:
        words = doc.split()
        for word in words:
            if (len(word) > 1):
                termCount[word] = termCount.get(word, 0) + 1
    return termCount

In [None]:
def makeVector(self, doc):
    termCount = {k: 0 for k in self.termSet}
    words = doc.split()
    for word in words:
        if (len(word) > 1):
            termCount[word] = termCount.get(word, 0) + 1
    res = list(termCount.values())
    return res

#### 4) Inverse Document Frequency
* zmniejszenie wagi potocznych wyrazów, zwiększenie wagi unikalnych
* Była to najbardziej czasochłonna operacja, jest głównym czynnikiem na ograniczenie wielkości zbioru danych wejściowych
* Dla __1000__ artykułów zajęło to ok. 30 min
* Dla __5000__ artykułów przewidywanych czas to ok. 5 godziny

In [None]:
def termIdf(self, term):
    n = len(self.ids)
    count = 0  
    docs = [d[1] for d in self.data]
    for doc in docs:
        words = doc.split()
        if term in words:
            count += 1
    return np.log(n / count)

In [None]:
def applyIdf(self):
    res = []
    for t in zip(range(len(self.termSet)), self.termSet):
        tidf = self.termIdf(t[1])
        col = self.idVectorMatrix[t[0]] * tidf
        res.append(col)
    res = np.array(res)
    return res

#### 5) Miara podobieństwa i normalizacja wekotrów

In [22]:
def correlation(q, d):
    l = np.transpose(q) @ d
    m = len(q) * len(d)
    return l / m

In [29]:
def normalizeV(vector):
    n = math.sqrt(sum([e**2 for e in vector]))
    return [e / n for e in vector]

#### 6) Low Rank Matrix Aproximation A (24474, 1000)

In [34]:
def compress(data, k):
    U, s, V = np.linalg.svd(data)
    Ur, sr, Vr = reduce(k, U, s, V)
    result = compose(Ur, sr, Vr)
    return result

def reduce(r, U, s, V):
    s = s[:r]
    U = U[:, :r]
    V = V[:r, :]
    return U, s, V

def compose(U, s, V):
    D = np.diag(s)
    return U @ D @ V

#### 6.1) Test LRMA
* Uruchomiłem to samo zapytanie __q__ dla macierzy przetworzonej SVD z danym __k__, następnie porównałem wyniki __id__

##### 6.1.1)
* q : __`'donald trump russia white house'`__
* k : __`[1000, 50, 100, 200, 500, 750]`__
* id : __`[18390, 17911, 17911, 18139, 18390, 18390]`__
* Wszystkie 3 artykuły dotyczą decyzji administracyjnych Donalda Trumpa, jednak żaden nie dotycyzł Rosji

##### 6.1.2)
* q : __`'olympic games athlets sport''`__
* k : __`[1000, 50, 100, 200, 500, 750]`__
* id : __`[17777, 17938, 17938, 18353, 17768, 17768]`__

#### 7) Zapytanie
* Metoda query zwraca id artykułu, następnie należy wyknoać zapytanie do bazy danych 
* __`SELECT content FROM articles WHERE id=?`__
* Przetworzenie jednego zapytania zajumje ok. 15 sec

In [None]:
def query(self, terms):
    cor = []
    words = self.stem(terms)
    words = [w for w in words if not w in STOP_WORDS]
    q = self.makeVector(words)
    q = normalizeV(q)
    for j in range(self.lrmaVecMat.shape[1]):
        d = self.idfVectorMatrix[:, j]
        d = normalizeV(d)
        corel = self.correlation(np.array(q), np.array(d))
        cor.append(corel)
    return self.ids[np.argmax(cor)]

##### Przykład :
* query : __'donald trump russia white house'__
* result title : __"The Phrase Putin Never Uses About Terrorism (and Trump Does) - The New York Times"__
* result content 

"MOSCOW Vladimir V. Putin, Russia's president, hardly misses a chance to talk tough on terrorism, once famously saying he would find Chechen terrorists sitting in the outhouse and rub them out. He and President Trump, notably dismissive of political correctness, would seem to have found common language on fighting terrorism except on one point of, well, language. During his campaign, Mr. Trump associated Islam with terrorism and criticized President Obama for declining to use the phrase radical Islamic terrorism. However, Mr. Putin, whom Mr. Trump so openly admires for his toughness, has, for more than a decade, done exactly what President Obama did. He has never described terrorists as Islamic and has repeatedly gone out of his way to denounce such language. I would prefer Islam not be mentioned in vain alongside terrorism, he said at a news conference in December, answering a question about the Islamic State, a group he often refers to as the   Islamic State, to emphasize a distinction with the Islamic religion. At the opening of a mosque in Moscow in 2015, Mr. Putin spoke of terrorists who cynically exploit religious feelings for political aims. In the Middle East, Mr. Putin said at the mosque opening, terrorists from the  Islamic State are compromising a great world religion, compromising Islam, sowing hatred, killing people, including clergy, and added that their ideology is built on lies and blatant distortions of Islam. He was careful to add, Muslim leaders are bravely and fearlessly using their own influence to resist this extremist propaganda. And, this being Russia, the failure to adhere to this   interpretation is a prosecutable offense: The Russian news media are required by law to note in any mention of the Islamic State that the reference is to a banned terrorist organization of that name, lest it be misconstrued as denigrating religion. Mr. Putin does not take this stance to soothe the feelings of Western liberals, a group he dismisses as hypocritical in any case. Putin prides himself on Russia's intelligence capabilities, the Brookings Institution wrote in a study of the early formation of his counterterrorism policies. Russian leaders think they know their enemy, and it is not the governments of majority Muslim countries such as Iraq and Iran, or the majority of Muslims living in Russia. Instead, Russian counterterrorism strategy focused on financing and militarily backing moderate Muslim leaders, with the breakthrough in the Chechen war coming when the regions imam, Akhmad Kadyrov, allied with the Russian military. His son, Ramzan Kadyrov, leads the region today. While embracing Islamic leaders as a centerpiece of its counterterrorism strategy, however, the Kremlin did not avoid drawing distinctions along religious lines. The Russian government backed the Kadyrov family's campaign to revive traditional Sufi Islam in Chechnya as a counterweight to the more austere Wahhabi denomination professed by many separatists. The Wahhabi strain was outlawed in another restive, predominantly Muslim province, Dagestan, and its adherents are persecuted in Russia, rights groups say. Still, the alliance with moderate Islamic religious leaders became important in pacifying Chechnya and other North Caucasus regions, which have ceased to pose a serious security threat to Russia. Putin rules a multiconfessional country, Orkhan Dzhemal, a commentator on Islamic affairs, said in a telephone interview, noting that in the United States, in contrast, Muslims are not a powerful political force. He cannot say Islamic terrorism for a simple reason. He doesn't want to alienate millions of Russians. The term preferred in Russian political parlance is international terrorism. In a phone call on Friday, President Trump and Mr. Putin discussed real cooperation in fighting terrorist groups in Syria. They could agree on an enemy. But the Kremlin statement described a priority placed on uniting forces in the fight against the main threat international terrorism.

#### 8) Wnioski
* Po zastosowaniu SVD i LRMA nastąpiła poprawa celności przy niektórych zapytaniach
* Ogónly poziom powtarzalności wyników zapytań jest przeciętny, niestety nie udało mi się znaleźć źródła problemu