토큰화(Tokenization)는 자연어 처리에서 코퍼스를 의미 있는 단위로 나누는 과정입니다. 주요 개념과 고려할 사항은 다음과 같습니다.

1. 단어 토큰화 (Word Tokenization)
  - 단어를 기준으로 텍스트를 나누는 작업입니다.
  - 구두점(punctuation) 제거가 일반적이지만 의미를 잃을 수도 있습니다.
  - 영어에서는 아포스트로피 등의 처리 방식이 다를 수 있으며, NLTK, Keras 등을 활용한 다양한 방법이 존재합니다.

2. 토큰화 중 선택의 문제
  - 예를 들어 "Don't"는 Do n't, Dont, Don ' t 등으로 분리할 수 있습니다.
  - NLTK의 word_tokenize, WordPunctTokenizer, Keras의 text_to_word_sequence 등의 도구들이 다른 방식으로 토큰화를 수행합니다.

3. 토큰화 시 고려 사항
  - 구두점이나 특수문자를 무조건 제거하면 의미가 사라질 수 있음.
  - 단어 내부에 띄어쓰기가 포함된 경우 고려 필요 (예: "New York").
  - 표준 토큰화 방식 (예: Penn Treebank Tokenization)에서는 하이픈 단어를 유지하고, 아포스트로피를 분리하는 규칙 적용.

4. 문장 토큰화 (Sentence Tokenization)
  - 문장을 기준으로 텍스트를 나누는 작업.
  - NLTK의 sent_tokenize는 마침표(.)가 포함된 약어(ex. Ph.D.) 등을 적절히 처리함.
   -한국어에서는 KSS(Korean Sentence Splitter) 같은 도구를 활용.
  
5. 한국어에서의 토큰화의 어려움
  - 한국어는 교착어 특성상 단순 띄어쓰기 기반 토큰화가 어렵고, 형태소 분석이 필요.
  - 형태소 분석기 (KoNLPy의 Okt, Komoran, Hannanum, Mecab) 사용이 일반적.

NLTK, Keras, KoNLPy, KSS 단어 및 문장 토큰화

In [1]:
!pip install nltk kss konlpy tensorflow

Collecting kss
  Downloading kss-6.0.4.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting emoji==1.2.0 (from kss)
  Downloading emoji-1.2.0-py3-none-any.whl.metadata (4.3 kB)
Collecting pecab (from kss)
  Downloading pecab-1.0.8.tar.gz (26.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.4/26.4 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jamo (from kss)
  Downloading jamo-0.4.1-py3-none-any.whl.metadata (2.3 kB)
Collecting hangul-jamo (from kss)
  Downloading hangul_jamo-1.0.1-py3-none-any.whl.metadata (899 bytes)
Collecting tossi (from kss)
  Downloading tossi-0.3.1.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting distance (fr

In [3]:
import nltk
from nltk.tokenize import word_tokenize, WordPunctTokenizer, sent_tokenize
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from konlpy.tag import Okt

In [9]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [10]:
# 단어토큰화 (영어)
text = "Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."

In [14]:
print("word_tokenize:",word_tokenize(text)) # 공백을 기준으로단어를 분리, 어포스트로피를 적절히 분리
print("WordPunctTokenizer:",WordPunctTokenizer().tokenize(text)) # 모든 구두점을 별도로 분리
print("sent_tokenize:",text_to_word_sequence(text)) # 소문자로 변환, 구두점을 제거하지만 어포스트로피는 유지

word_tokenize: ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']
WordPunctTokenizer: ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']
sent_tokenize: ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']


문장 토큰화

In [15]:
text = "His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff."

In [17]:
for sentence in sent_tokenize(text):  # 마침표를  기준으로 문장을 나누지만 문장 내부의 마침표는 잘 처리
  print(sentence)

His barber kept his word.
But keeping such a huge secret to himself was driving him crazy.
Finally, the barber went up a mountain and almost to the edge of a cliff.


In [18]:
text = 'I am actively looking for Ph.D. students. And you are a Ph.D student.'
for sentence in sent_tokenize(text):  # 마침표를  기준으로 문장을 나누지만 문장 내부의 마침표는 잘 처리
  print(sentence)

I am actively looking for Ph.D. students.
And you are a Ph.D student.


한국어

In [19]:
okt = Okt()
text = '자연어 처리는 어렵지만 재미있습니다.'
print("형태소 단위 토큰화", okt.morphs(text))
print("명사 추출",okt.nouns(text))
print("품사 태깅",okt.pos(text))

형태소 단위 토큰화 ['자연어', '처리', '는', '어렵지만', '재미있습니다', '.']
명사 추출 ['자연어', '처리']
품사 태깅 [('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ('어렵지만', 'Adjective'), ('재미있습니다', 'Adjective'), ('.', 'Punctuation')]


  - 정제
    - 텍스트에서 노이즈 데이터를 제거
    - 노이즈 데이터 : 의미없는 특수문자, 중복된 공백, 불필요한 단어등
    - 불용어(stopwords)제거, 등장 빈도가 적은 단어 삭제 등의 작업
  - 정규화
    - 의미는 같지만 다른형태로 표현된 단어를 통합하는 과정
    - 방법
      - 대소문자 통합: "KOREA", "Korea", "korea"
      - 표현 방식 통합 : "KOREA", "KOR", "USA" ,"US"
      - 어간 추출(Stemming)과 표제어 추출(Lemmatation) : "running" -> "run" 단어의 기본형으로 변환
  - 불필요한 단어 제거
    - 등장빈도가 낮은단어
    - 짧은 단어(1~2자) 제거
  - 정규표현식(Regrex)활용
    - 특정패턴제거(html, 날짜, 특수기호)
    - 텍스트 정제 과정에서 반복적으로 등장하는 패턴을 처리


정제와 정규화 함수

In [21]:
import re
from collections import Counter  # 등장빈도수구해서 낮은 등장횟수는 제거할 용도
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords # 불용어
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

어간:
  - 단어에서 접사(prefix, suffix) 제거
  - 의미가 왜곡 될수 있음
  - going :  go
  - files :  fli  X
  - happily  lappili  X
  - running run
표제어
  - 문법적 정보(품사)를 고려해서 사전에 등재된 기본형으로 변환
  - 사전을 이용 단어의 원형을 찾을수 있다
  - 어간보다 더 복잡한 연산을 수행
  - going :  go
  - files :  fly
  - happily  happy
  - running run   

어간추출 vs 표제어 추출
- 방식  :  단순규칙기반 접사 제거   -  사전 기반 변환
- 속도  :   빠름                    - 느림
- 정확성 :   낮음(의미변환)         -  높음(문법적 의미 유지)

- 빠른 텍스트 분석 : 어간추출
- 정화기한 의미를 보전 : 표제어 추출
- 대량의 데이터에서 성능 최적화 : 어간 추출
- 언어의 정확성이 중요한 모델 : 표제어 추출



In [23]:
# 정제와정규화(대소문자, 특수문자, 불용어&짧은단어,
# 어간&표제어 추출)
def clean_text(text):
  # 대소문자
  text = text.lower()
  # 특수문자 제거
  text = re.sub('[^a-zA-Z\s]', '', text)
  # 불용어&짧은단어
  words = word_tokenize(text)
  stop_words = set(stopwords.words('english'))  # 불용어 사전을 가져온다
  words = [word for word in words if word not in stop_words and len(word) > 2]
  # 표기정규화(어간, 표제어)
  stemmer = PorterStemmer()  #어간
  lemmatizer = WordNetLemmatizer() # 표제어
  stemmed_words = [stemmer.stem(word) for word in words]
  lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

  return ' '.join(stemmed_words), ' '.join(lemmatized_words)


In [28]:
test_text = "I was wondering if anyone out there could enlighten me on this car. US and USA are similar. The automobile is expensive!"
# 정제후 결과
stemmed_text, lemmatized_text =  clean_text(test_text)

In [29]:
print(stemmed_text)
print(lemmatized_text)

wonder anyon could enlighten car usa similar automobil expens
wondering anyone could enlighten car usa similar automobile expensive


불용어(stop_words)
  - 자주 등장하지만 의미 분석에 큰 기여를 하지 안히는 단어
  - 영어 : i my me the is an  over
  - 한국어 : 조사(를, 에서) 접속사(그리고, 그러나), 관형사(이 그 저)

In [31]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# 불용어 확인
stop_words_list = stopwords.words("english")
stop_words_list[:10]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

In [39]:
# NLTK를 이용한 불용어제거
text = "Family is not an important thing. It's everything."
stop_words = stopwords.words("english")
stop_words = set(stop_words + ['.',"'s"])  # 사용자가 추가한 stopword(불용어)
# 문장을 토큰화
word_tokens =  word_tokenize(text)

result = [ word for word in word_tokens if word not in stop_words]

print('제거 전',word_tokens)
print('제거 후',result)

제거 전 ['Family', 'is', 'not', 'an', 'important', 'thing', '.', 'It', "'s", 'everything', '.']
제거 후 ['Family', 'important', 'thing', 'It', 'everything']


In [40]:
clean_text(text)

('famili import thing everyth', 'family important thing everything')

https://github.com/stopwords-iso/stopwords-ko?ref=deep.chulgil.me

In [42]:
!npm  install stopwords-ko

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K
added 1 package in 2s
[1G[0K⠹[1G[0K

In [43]:
!wget "https://registry.npmjs.org/stopwords-ko/-/stopwords-ko-0.2.0.tgz"

--2025-02-24 11:37:56--  https://registry.npmjs.org/stopwords-ko/-/stopwords-ko-0.2.0.tgz
Resolving registry.npmjs.org (registry.npmjs.org)... 104.16.25.34, 104.16.0.35, 104.16.1.35, ...
Connecting to registry.npmjs.org (registry.npmjs.org)|104.16.25.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5166 (5.0K) [application/octet-stream]
Saving to: ‘stopwords-ko-0.2.0.tgz’


2025-02-24 11:37:56 (49.1 MB/s) - ‘stopwords-ko-0.2.0.tgz’ saved [5166/5166]



In [44]:
!tar -zxvf stopwords-ko-0.2.0.tgz

package/package.json
package/.npmignore
package/README.md
package/LICENSE
package/bower.json
package/stopwords-ko.json
package/.travis.yml


In [55]:
import json
with open('/content/package/stopwords-ko.json',encoding='utf-8') as f:
  korean_stop_words = json.load(f)

In [56]:
len(korean_stop_words), korean_stop_words[100:110]

(679,
 ['그러므로',
  '그러한즉',
  '그런 까닭에',
  '그런데',
  '그런즉',
  '그럼',
  '그럼에도 불구하고',
  '그렇게 함으로써',
  '그렇지',
  '그렇지 않다면'])

In [59]:
# 한국어에서 불용어 제거  okt
# 한국어 형태소 분석기
okt = Okt()
example = '화사, 박나래와 "전화 차단" 선언 후 "큰엄마 같아, 사랑해"'

korean_stop_words.append('후')

# 토큰화(명사)
okt_nouns = okt.nouns(example)
# 불용어 제거
result = [word for word in okt_nouns if word not in korean_stop_words]
print("불용어 제거전", okt_nouns)
print("불용어 제거후", result)

불용어 제거전 ['화사', '박나래', '전화', '차단', '선언', '후', '엄마', '사랑']
불용어 제거후 ['화사', '박나래', '전화', '차단', '선언', '엄마', '사랑']


In [58]:
'후' in korean_stop_words

False

정규화

In [64]:
import re
text = 'Natural Language Processing is amazing'
tokens = re.findall(r"\b\w+\b", text)  # 단어 경계를 기준으로 토큰을 분리
print(tokens)

# 비 단어문자(공백, 특수문자) 기준으로 분리
text = 'Hello, world! NLP is fun.'
tokens =  re.split(r"\W+", text)
print(tokens)

# 특수문자 제거
text = 'Hello, world! NLP is fun.'
tokens =  re.sub(r"[^a-zA-Z0-9\s]", "",text)
print(tokens)

# 한국어
text = '안녕하세요 Hello 1234 반갑습니다'
tokens =  re.findall(r"[ㄱ-ㅎ가-힣]+", text)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'amazing']
['Hello', 'world', 'NLP', 'is', 'fun', '']
Hello world NLP is fun
['안녕하세요', '반갑습니다']


정수인코딩 : 단어수준의 토큰화 후 정수 인코딩
   - 영어 : 띄어쓰기로 단어를 구분할수 있고, 형태 변화가 상대적으로 단순해서 일반적인 NLP 전처리 패턴이 잘 동작
   - 띄어쓰기 기반 단어 토큰화 : World Tokenizer
   - 불용어 제거 stop_words
   - 단어 빈도수를 기준으로 정렬해서 정수 인덱스 부여
   - OOV(Out Of Vocabulary) 문제 해결하기위해 OOV 토큰 추가

In [74]:
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

text = '''
The leaders of Europe’s two nuclear powers are rushing to the White House to try to reclaim a central role for themselves and for Ukraine after they were cut out of US-Russia talks on ending the war.

President Donald Trump sent shock waves through the transatlantic alliance last week and played into Russian President Vladimir Putin’s hands while attacking Ukrainian President Volodymyr Zelensky and trashing the truth about how the war started.

French President Emmanuel Macron will visit Trump on Monday — the third anniversary of Russia’s brutal invasion of Ukraine, a sovereign democracy, which has killed tens of thousands of civilians and left Putin and his forces accused of war crimes.

British Prime Minister Keir Starmer will follow on Thursday, in the most treacherous moment yet of his young premiership, with large gaps opening between Washington and London on the reality of Ukraine’s plight.

Britain and France are drawing up plans for a European “reassurance force,” perhaps including up to 30,000 troops that could deploy to Ukraine in the event of a peace deal. The idea, however, faces massive barriers — not least that a deal that both Zelensky and Putin could agree to sign seems highly unlikely. And Starmer has already warned that the force couldn’t work without a US “backstop,” which could potentially include security guarantees, American intelligence cooperation, air support and heavy lift transport. A key takeaway this week will be whether Trump has any interest given Russia’s opposition to NATO troops in Ukraine under any flag.

As Trump leads the US in a new direction on Ukraine, historic schisms are opening that threaten the transatlantic alliance and the post-World War II order. Trump treats America’s longtime friends — who have failed to deliver on calls by successive presidents to spend more on defense — as adversaries. And the new administration has already shattered years of European assumptions about America’s security guarantees to the West.

A Ukrainian soldier prepares to fire a howitzer toward Russian positions on the front line near Chasiv Yar, in Ukraine's Donetsk region, on February 7.
A Ukrainian soldier prepares to fire a howitzer toward Russian positions on the front line near Chasiv Yar, in Ukraine's Donetsk region, on February 7. Oleg Petrasiuk/Ukraine's 24th Mechanized Brigade/AP
The president’s siding with Putin over Zelensky — and his bid to extract a punitive deal to export Ukraine’s rare earth minerals as a payback for past US aid — shocked transatlantic allies. Defense Secretary Pete Hegseth’s warning in Brussels this month that Europe must take primary responsibility for its own security called into question NATO’s creed of mutual defense. And Vice President JD Vance’s slam of European governments and values in a speech in Munich was seen in Europe as an attempt to destabilize continental leaders on behalf of far-right populists who take ideological inspiration from the MAGA movement.

In an extraordinary comment that captured the historic times, the likely next leader of Germany Friedrich Merz, leader of the conservative Christian Democratic Union that won Sunday’s general election, according to exit polls, set out the new government’s program.

“My absolute priority will be to strengthen Europe as quickly as possible so that, step by step, we can really achieve independence from the USA,” Merz said at a televised roundtable after the exit polls also showed huge gains for the extreme-right AfD party.

“I would never have believed that I would have to say something like that on television. But at the very least, after Donald Trump’s statements last week, it is clear that the Americans — at least this part of the Americans in this administration — are largely indifferent to the fate of Europe,” Merz said.

‘We came very, very close to signing something’
Trump appears to be aiming for a lightning-fast peace agreement — similar to the velocity of his domestic transformation a month after his return to the White House.

His Middle East envoy Steve Witkoff, who has a leading role in the Ukraine talks, on Sunday raised the prospect of a swift breakthrough following the meeting between US and Russian officials last week. “We came very, very close to signing something. And I think we will be using that framework as a guidepost to get a peace deal done between Ukraine and Russia,” Witkoff told CNN’s Jake Tapper on “State of the Union.”

He added: “The president understands how to get deals done. Deals only work when they’re good for all the parties. And that’s the pathway that we’re on here.”

Hegseth implied Sunday that Trump’s depiction of Zelensky as a “dictator” last week was meant to avoid annoying Putin in order to get concessions at the negotiating table. “Standing here and saying, ‘you’re good, you’re bad, you’re a dictator, you’re not a dictator, you invaded, you didn’t’ — it’s not useful. It’s not productive,” Hegseth said on “Fox News Sunday.”

Delegations from the United States, left, and Russia, right, meet in Riyadh, Saudi Arabia, on February 18. US Secretary of State Marco Rubio is seen second from left, between Middle East envoy Steve Witkoff, far left, and national security adviser Mike Waltz. Russian Foreign Minister Sergei Lavrov is seen on the far right, next to foreign policy adviser Yuri Ushakov.
Delegations from the United States, left, and Russia, right, meet in Riyadh, Saudi Arabia, on February 18. US Secretary of State Marco Rubio is seen second from left, between Middle East envoy Steve Witkoff, far left, and national security adviser Mike Waltz. Russian Foreign Minister Sergei Lavrov is seen on the far right, next to foreign policy adviser Yuri Ushakov. Evelyn Hockstein/Reuters
But Sen. Jack Reed, the ranking Democrat on the Senate Armed Services Committee, accused Trump of “surrendering” to Putin. “This is not a statesman or a diplomat. This is just someone who admires Putin, does not believe in the struggle of the Ukrainians, and is committed to cozying up to an autocrat.” The Rhode Island Democrat added on ABC’s “This Week”: “Putin will not stop in Ukraine. He will begin in a campaign, both clandestine and in many cases overt, to undermine the other governments in Eastern Europe and it’ll create chaos.”

Trump’s turn against Ukraine and his rush to embrace Putin ahead of a potential summit in the coming weeks has Ukrainians and Europeans fearing that he simply plans to seal a deal with Russia and then impose it on Kyiv. That’s why Macron and Starmer will try to convince the president he will look bad if he fails to drive a hard bargain with Putin.

“What I am going to do is that I am going to tell him basically, you cannot be weak in the face of President Putin. It’s not you, it’s not your trademark,” Macron said, paraphrasing his message to Trump in a social media Q&A on Thursday.

An Elysée Palace official said that Macron shared Trump’s goal of ending Russia’s war of aggression and was bringing proposals that were reaffirmed in his talks with European leaders, particularly with the British. “He is traveling to Washington with this goal in mind, sharing this desire to end the conflict while making every effort to maintain our support for Ukraine, strengthen European security and ensure that Ukraine is fully involved in these efforts, and to ensure that Ukraine’s interests — which are ours as well — are fully taken into account.”

Starmer on Sunday laid out a tough pro-Zelensky approach, which conflicted with Trump’s position, a day after talking to the Ukrainian leader on the phone. “Nobody wants the bloodshed to continue. Nobody, least of all the Ukrainians,” Starmer said at the Scottish Labour Party conference in Glasgow. “But after everything that they have suffered, after everything that they have fought for, there could be no discussion about Ukraine without Ukraine, and the people of Ukraine must have a long-term secure future.”

Visitors stand next to a makeshift memorial paying tribute to Ukrainian and foreign fighters at Independence Square in Kyiv ahead of the third anniversary of Russia's invasion of Ukraine.
Visitors stand next to a makeshift memorial paying tribute to Ukrainian and foreign fighters at Independence Square in Kyiv ahead of the third anniversary of Russia's invasion of Ukraine. Roman Pilipey/AFP/Getty Images
The phrase “no discussion about Ukraine without Ukraine” encapsulated the principles of the Biden administration’s tight coordination with Europe and Kyiv over the war. But that consensus has been buckled by Trump. And Starmer will risk further angering the president before he arrives. Bridget Phillipson, a British Cabinet minister, told Sky News on Sunday that the UK government would unveil a new set of sanctions against Russia on Monday.

Trump says that Zelensky — a hero in the west for leading Ukraine’s resistance to Russia — didn’t deserve to be at the talks. “I’ve been watching for years, and I’ve been watching him negotiate with no cards. He has no cards. And you get sick of it,” Trump said on Fox News Radio’s “The Brian Kilmeade Show” Friday.

Trump also criticized his visitors. “You know, they haven’t done anything,” he told Kilmeade. “You know, Macron’s a friend of mine, and I met with the prime minister, and you know, he’s a very nice guy, but nobody’s done anything,” Trump said.

Macron may try to correct Trump on that point, one person familiar with the matter said. But the French president is most intent on managing the way forward, providing his view on how Europe can help assure Ukraine’s security, as long as it is incorporated into talks to end the war.

Starmer: The US is ‘right’ to complain about European defense spending
The French and British leaders will also arrive in Washington as Trump demands steep hikes in defense spending by NATO members, which would mean excruciating fiscal choices for governments saddled by constricted public finances. Both Macron and Starmer have spoken of the need for European nations to do more to protect the continent, but their capacity to act is likely to fall far short of the American president’s expectations.

Despite Starmer saying Sunday that Trump was “right” in calling for Europe to step up, Phillipson declined to say, for instance, whether her boss would tell Trump a target date for his government to raise defense spending to 2.5% of GDP. The US president has demanded 5%.

Both Macron and Starmer, who spoke by telephone Sunday, are expected to argue that Washington’s continued presence in Europe and security guarantees are critical to peace in the west, despite the Trump administration’s desire to pivot to the challenge posed by China.

British Prime Minister Keir Starmer speaks during Day 3 of the Scottish Labour Party conference on Sunday in Glasgow.
British Prime Minister Keir Starmer speaks during Day 3 of the Scottish Labour Party conference on Sunday in Glasgow. Peter Summers/Getty Images
The European message is going to be a tough sell to a transactional president who doesn’t appreciate alliances as a force multiplier for American power and who seems to prefer the company of autocrats to that of his fellow democratic leaders.

Macron has already tried to shape Trump’s thinking on Ukraine, arranging a three-way meeting with the then-US president-elect and Zelensky in Paris last December. Trump was respectful and “in listening mode” during the meeting, one official said, as Zelensky laid out the necessity of security guarantees for Ukraine once the war ends. Macron tried to impress on Trump that Putin had changed since he was last in office and warned that if Ukraine was defeated, the US could look weak to its other rivals — namely China.

But two months later, the talks do not appear to have left a lasting impression on Trump, given his comments of the last week. And European officials acknowledge it will be impossible to persuade Trump to abandon his erroneous views of the war, including that it was provoked by Ukraine or that the United States was conned into supporting a man he claims is a dictator.

Instead, they say, it will be more useful to look ahead, as Trump prepares to sit down soon with Putin and the contours of a possible peace agreement emerge.
'''
tokens = word_tokenize(text)
stop_words = stopwords.words('english') + [',', '.', '’', '—', '“', '”']

tokens = [word for word in tokens if word.lower() not in stop_words]
print(tokens)
# 빈도수 기반 정수 인코딩
vocab = Counter(tokens)
print(vocab.most_common())



In [87]:
temp =  clean_text(text)[1]
new_tokens = word_tokenize(temp)
stop_words = stopwords.words('english') + [',', '.', '’', '—', '“', '”']
new_tokens = [word for word in new_tokens if word.lower() not in stop_words]
# 빈도수 기반 정수 인코딩
vocab = Counter(tokens)
print(vocab.most_common(30))
word_to_index = { word: index+2  for index, (word,_) in enumerate(vocab.most_common(30)) }
print(word_to_index)

[('Ukraine', 29), ('Trump', 29), ('Putin', 13), ('Starmer', 12), ('Russia', 11), ('Europe', 10), ('Macron', 10), ('said', 10), ('European', 9), ('US', 9), ('security', 9), ('Sunday', 9), ('war', 8), ('left', 8), ('president', 8), ('last', 7), ('Zelensky', 7), ('talks', 6), ('President', 6), ('week', 6), ('Russian', 6), ('Ukrainian', 6), ('British', 6), ('right', 6), ('leaders', 5), ('Minister', 5), ('could', 5), ('peace', 5), ('deal', 5), ('defense', 5)]
{'Ukraine': 2, 'Trump': 3, 'Putin': 4, 'Starmer': 5, 'Russia': 6, 'Europe': 7, 'Macron': 8, 'said': 9, 'European': 10, 'US': 11, 'security': 12, 'Sunday': 13, 'war': 14, 'left': 15, 'president': 16, 'last': 17, 'Zelensky': 18, 'talks': 19, 'President': 20, 'week': 21, 'Russian': 22, 'Ukrainian': 23, 'British': 24, 'right': 25, 'leaders': 26, 'Minister': 27, 'could': 28, 'peace': 29, 'deal': 30, 'defense': 31}
